topic Re: Eliminating duplicate records across three datasets in SAS Programming

Eliminating duplicate records across three datasets

HB — Tue, 25 Jul 2017 15:58:50 GMT

I have three datasets representing people by year.They should be unique. If a person in in year1, I don't want them in year2 or year3. If in year2, not in 1 or 3, if in 3 not in 1 or 2.

Unfortunately in the real world, people manage to cross neat dataset boundaries and show up wherever.

I created a field in each record to mark the year of the record and did this:

proc sort data=fy07_data;
	by ssn;
run;

proc sort data=fy08_data;
	by ssn;
run;

proc sort data=fy09_data;
	by ssn;
run;

data data_07_09;
	merge 	        fy07_data (in=my07) 
			fy08_data (in=my08)
			fy09_data (in=my09);
	by ssn;
run;
* I was going to do something with the in's and didn't; 

data data_07_09_final;
	set data_07_09;
	if myyear07flag = 1 and myyear08flag = 1 then delete;
	if myyear07flag = 1 and myyear09flag = 1 then delete;
	if myyear08flag = 1 and myyear09flag = 1 then delete;
	myawardyear = '00';
	if myyear07flag = 1 then myawardyear = '07';
	if myyear08flag = 1 then myawardyear = '08';
	if myyear09flag = 1 then myawardyear = '09';
run;

after which I drop the myyearXXflag fields befroe outputting the dataset. This seems somehow inelegant. It just feels like the process could be terser and cleaner.

Is there a better way to accomplish this or is this as good as anything?

Thanks.

Re: Eliminating duplicate records across three datasets

Reeza — Tue, 25 Jul 2017 15:57:38 GMT

I'd probably stack them together, sorting by year, removing duplicates using PROC SORT NODUPKEY and then transposing if desired. Not sure there's a reason to at this point. You don't have any other variables in the dataset? If they had the same variables they'd be overwritten.

data combined;
set fy07_data fy08_data fy09_data indsname=source;
source_name = source;
run;

proc sort data=combined; 
by id source_name;
run;

proc sort data=combined nodupkey;
by id;
run;

*transpose if desired to 'wide' format, but assuming the data is the same what happens to the other variables?;

Re: Eliminating duplicate records across three datasets

HB — Tue, 25 Jul 2017 16:35:36 GMT

The indsname option is new to me and super cool, but this doesn't do what I want. I want to delete records common between the tables.

So with

data fy07_data;
   input ssn;
   datalines;
11
12
13
14
15
16
17
18
21
22
31
32
;

data fy08_data;
   input ssn;
   datalines;
11
12
21
22
23
24
25
26
27
28
33
34
;

data fy09_data;
   input ssn;
   datalines;
13
14
21
22
31
32
33
34
35
36
37
38
;


data combined;
set fy07_data fy08_data fy09_data indsname=source;
source_name = source;
run;

proc sort data=combined; 
by source_name ssn;
run;

proc sort data=combined nodupkey;
by ssn;
run;


proc print data=combined noobs; 
 var ssn source_name;
run;

I get a data set of

                                     ssn     source_name

                                      11    WORK.FY07_DATA
                                      12    WORK.FY07_DATA
                                      13    WORK.FY07_DATA
                                      14    WORK.FY07_DATA
                                      15    WORK.FY07_DATA
                                      16    WORK.FY07_DATA
                                      17    WORK.FY07_DATA
                                      18    WORK.FY07_DATA
                                      21    WORK.FY07_DATA
                                      22    WORK.FY07_DATA
                                      23    WORK.FY08_DATA
                                      24    WORK.FY08_DATA
                                      25    WORK.FY08_DATA
                                      26    WORK.FY08_DATA
                                      27    WORK.FY08_DATA
                                      28    WORK.FY08_DATA
                                      31    WORK.FY07_DATA
                                      32    WORK.FY07_DATA
                                      33    WORK.FY08_DATA
                                      34    WORK.FY08_DATA
                                      35    WORK.FY09_DATA
                                      36    WORK.FY09_DATA
                                      37    WORK.FY09_DATA
                                      38    WORK.FY09_DATA

I want a data set of

                                     ssn     source_name

                                      15    WORK.FY07_DATA
                                      16    WORK.FY07_DATA
                                      17    WORK.FY07_DATA
                                      18    WORK.FY07_DATA
                                      23    WORK.FY08_DATA
                                      24    WORK.FY08_DATA
                                      25    WORK.FY08_DATA
                                      26    WORK.FY08_DATA
                                      27    WORK.FY08_DATA
                                      28    WORK.FY08_DATA
                                      35    WORK.FY09_DATA
                                      36    WORK.FY09_DATA
                                      37    WORK.FY09_DATA
                                      38    WORK.FY09_DATA

Re: Eliminating duplicate records across three datasets

Ksharp — Wed, 26 Jul 2017 12:23:41 GMT

data fy07_data;
   input ssn;
   datalines;
11
12
13
14
15
16
17
18
21
22
31
32
;

data fy08_data;
   input ssn;
   datalines;
11
12
21
22
23
24
25
26
27
28
33
34
;

data fy09_data;
   input ssn;
   datalines;
13
14
21
22
31
32
33
34
35
36
37
38
;


data combined;
set fy07_data fy08_data fy09_data indsname=source;
source_name = source;
run;
proc sql;
select distinct source_name,ssn
 from combined
  group by ssn 
   having count(distinct source_name)=1;
quit;

Re: Eliminating duplicate records across three datasets

Ksharp — Wed, 26 Jul 2017 12:31:39 GMT


proc sql;
(
select * from fy07_data 
except 
(select * from fy08_data union select * from fy09_data )
)

union

(
select * from  fy08_data
except 
(select * from fy07_data union select * from fy09_data )
)

union

(
select * from  fy09_data
except 
(select * from fy08_data union select * from fy07_data )
)
;
quit;

Re: Eliminating duplicate records across three datasets

HB — Mon, 31 Jul 2017 15:11:25 GMT

Okay, with that much unioning and excepting going on, I feel better about my

        if myyear07flag = 1 and myyear08flag = 1 then delete;
	if myyear07flag = 1 and myyear09flag = 1 then delete;

stuff. Lol.