Hi,
I am fairly new to data cleaning. I want to update the missing values in the master file for a variable (county code) with values from another dataset. In both files , a corresponding zip code has a corresponding county code. I do not have a unique identifier for the files , so I cannot merge them. The master file has a unique identifier and has about 85,000 unique records for a state . There could be multiple records in the master file who have the same zip code and same county. I want to replace the " missing county value" for a "non-missing zip code". The transaction file has a corresponding county for each unique zip code. Two different zip code could lie in the same county. I need your suggestions for the code to help me do this. Please advice.
Thanks,
DR
Thanks for the suggestion Scott. I am going to try this. Also, I should have mentioned this initially, my master and transaction datasets are SAS datasets. The zip code and county code are not labeled sequentially. But, I will just rename them and use your code.
Can you share part of the two datafiles and an example of how the final output should look like?
Hi , Thanks for your reply. I have attached an excel sheet with a example of the layout of master file. transaction file and output file. The master file is a huge data set with unique ID. It has three different fields each for zip code and county labeled zip1 and county1, zip2 and county2 and zip3 and county3 for three different addresses for the individual. The transaction file has information for the county corresponding to a respective zip code. The transaction file is smaller than the master file. It has two fields zip and county. With the help of the transaction file , I want to assign a county value to missing county1, county2 county3 fields in the master file. Each of the fields zip1, zip2, zip3 and zip have $3 format and county1, county2, county3 and county field have a $5 format. I hope my explanation helps. Please let me know if you have any questions.
How about a HASH TABLE?
DATA MASTER;
INFILE DATALINES DLM=",";
INPUT ID $ ZIP1 COUNTY1 ZIP2 COUNTY2 ZIP3 COUNTY3;
DATALINES;
ER001,30345,145,36751,.,39768,165
ER002,30546,162,37654,.,38765,177
ER003,30357,176,39768,165,35505,154
ER004,30345,.,38765,177,35674,122
ER005,30578,189,35505,154,35555,134
ER006,39820,089,35674,122,34545,59
ER007,39145,121,35555,134,36751,
ER008,36102,063,34545,059,35654,
ER009,36102,063,34545,059,35654,
;
RUN;
DATA TRANSACTION;
INFILE DATALINES DLM="," MISSOVER;
INPUT ZIP COUNTY;
DATALINES;
30345,244
36751,255
37654,233
30345,145
30546,162
30357,176
30578,189
39820,089
39145,121
36102,063
39768,165
38765,177
35505,154
35674,122
35555,134
34545,059
35654,192
;
RUN;
DATA WANT;
IF 0 THEN SET TRANSACTION;;
IF _N_ = 1 THEN DO;
DECLARE HASH H(DATASET:"TRANSACTION");
H.DEFINEKEY("ZIP");
H.DEFINEDATA("ZIP","COUNTY");
H.DEFINEDONE();
END;
DO UNTIL (EOF);
SET MASTER END=EOF;
ARRAY _COUNTY {*} COUNTY1-COUNTY3;
ARRAY _ZIP {*} ZIP1-ZIP3;
DO I = 1 TO 3;
IF _COUNTY{I} = . THEN DO;
RC = H.FIND(KEY:_ZIP{I});
_COUNTY{I} = COUNTY;
END;
END;
OUTPUT;
END;
RUN;
Thanks for the suggestion Scott. I am going to try this. Also, I should have mentioned this initially, my master and transaction datasets are SAS datasets. The zip code and county code are not labeled sequentially. But, I will just rename them and use your code.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.