Solved: Handling misssing values

dr2014 · Posted 07-22-2014 04:21 PM

Hi,

I am fairly new to data cleaning. I want to update the missing values in the master file for a variable (county code) with values from another dataset. In both files , a corresponding zip code has a corresponding county code. I do not have a unique identifier for the files , so I cannot merge them. The master file has a unique identifier and has about 85,000 unique records for a state . There could be multiple records in the master file who have the same zip code and same county. I want to replace the " missing county value" for a "non-missing zip code". The transaction file has a corresponding county for each unique zip code. Two different zip code could lie in the same county. I need your suggestions for the code to help me do this. Please advice.

Thanks,

DR

dr2014 · Posted 07-24-2014 04:15 PM

Thanks for the suggestion Scott. I am going to try this. Also, I should have mentioned this initially, my master and transaction datasets are SAS datasets. The zip code and county code are not labeled sequentially. But, I will just rename them and use your code.

View solution in original post

pronabesh · Posted 07-23-2014 11:27 PM

Can you share part of the two datafiles and an example of how the final output should look like?

dr2014 · Posted 07-24-2014 04:08 AM

Hi , Thanks for your reply. I have attached an excel sheet with a example of the layout of master file. transaction file and output file. The master file is a huge data set with unique ID. It has three different fields each for zip code and county labeled zip1 and county1, zip2 and county2 and zip3 and county3 for three different addresses for the individual. The transaction file has information for the county corresponding to a respective zip code. The transaction file is smaller than the master file. It has two fields zip and county. With the help of the transaction file , I want to assign a county value to missing county1, county2 county3 fields in the master file. Each of the fields zip1, zip2, zip3 and zip have $3 format and county1, county2, county3 and county field have a $5 format. I hope my explanation helps. Please let me know if you have any questions.

Scott_Mitchell · Posted 07-24-2014 05:44 AM

How about a HASH TABLE?

DATA MASTER;

INFILE DATALINES DLM=",";

INPUT ID $ ZIP1 COUNTY1 ZIP2 COUNTY2 ZIP3 COUNTY3;

DATALINES;

ER001,30345,145,36751,.,39768,165

ER002,30546,162,37654,.,38765,177

ER003,30357,176,39768,165,35505,154

ER004,30345,.,38765,177,35674,122

ER005,30578,189,35505,154,35555,134

ER006,39820,089,35674,122,34545,59

ER007,39145,121,35555,134,36751,

ER008,36102,063,34545,059,35654,

ER009,36102,063,34545,059,35654,

;

RUN;

DATA TRANSACTION;

INFILE DATALINES DLM="," MISSOVER;

INPUT ZIP COUNTY;

DATALINES;

30345,244

36751,255

37654,233

30345,145

30546,162

30357,176

30578,189

39820,089

39145,121

36102,063

39768,165

38765,177

35505,154

35674,122

35555,134

34545,059

35654,192

;

RUN;

DATA WANT;

IF 0 THEN SET TRANSACTION;;

IF _N_ = 1 THEN DO;

DECLARE HASH H(DATASET:"TRANSACTION");

H.DEFINEKEY("ZIP");

H.DEFINEDATA("ZIP","COUNTY");

H.DEFINEDONE();

END;

DO UNTIL (EOF);

SET MASTER END=EOF;

ARRAY _COUNTY {*} COUNTY1-COUNTY3;

ARRAY _ZIP {*} ZIP1-ZIP3;

DO I = 1 TO 3;

IF _COUNTY{I} = . THEN DO;

RC = H.FIND(KEY:_ZIP{I});

_COUNTY{I} = COUNTY;

END;

OUTPUT;

END;

RUN;

dr2014 · Posted 07-24-2014 04:15 PM

Thanks for the suggestion Scott. I am going to try this. Also, I should have mentioned this initially, my master and transaction datasets are SAS datasets. The zip code and county code are not labeled sequentially. But, I will just rename them and use your code.

Handling misssing values

Re: Handling misssing values

Re: Handling misssing values

Re: Handling misssing values

Re: Handling misssing values

Re: Handling misssing values

Handling misssing values

Re: Handling misssing values

Re: Handling misssing values

Re: Handling misssing values

Re: Handling misssing values

Re: Handling misssing values

Registration is open