DATA Step, Macro, Functions and more

Eliminating duplicate records across three datasets

Reply
Regular Contributor
Regular Contributor
Posts: 162

Eliminating duplicate records across three datasets

[ Edited ]

I have three datasets representing people by year.They should be unique.  If a person in in year1, I don't want them in year2 or year3.  If in year2, not in 1 or 3, if in 3 not in 1 or 2. 

 

Unfortunately in the real world, people manage to cross neat dataset boundaries and show up wherever.

 

I created a field in each record to mark the year of the record and did this:

 

proc sort data=fy07_data;
	by ssn;
run;

proc sort data=fy08_data;
	by ssn;
run;

proc sort data=fy09_data;
	by ssn;
run;

data data_07_09;
	merge 	        fy07_data (in=my07) 
			fy08_data (in=my08)
			fy09_data (in=my09);
	by ssn;
run;
* I was going to do something with the in's and didn't; 
data data_07_09_final; set data_07_09; if myyear07flag = 1 and myyear08flag = 1 then delete; if myyear07flag = 1 and myyear09flag = 1 then delete; if myyear08flag = 1 and myyear09flag = 1 then delete; myawardyear = '00'; if myyear07flag = 1 then myawardyear = '07'; if myyear08flag = 1 then myawardyear = '08'; if myyear09flag = 1 then myawardyear = '09'; run;

after which I drop the myyearXXflag fields befroe outputting the dataset.  This seems somehow inelegant.  It just feels like the process could be terser and cleaner.

 

Is there a better way to accomplish this or is this as good as anything?

 

Thanks.

Super User
Posts: 19,861

Re: Eliminating duplicate records across three datasets

I'd probably stack them together, sorting by year, removing duplicates using PROC SORT NODUPKEY and then transposing if desired. Not sure there's a reason to at this point. You don't have any other variables in the dataset? If they had the same variables they'd be overwritten. 

 

data combined;
set fy07_data fy08_data fy09_data indsname=source;
source_name = source;
run;

proc sort data=combined; 
by id source_name;
run;

proc sort data=combined nodupkey;
by id;
run;

*transpose if desired to 'wide' format, but assuming the data is the same what happens to the other variables?;
Regular Contributor
Regular Contributor
Posts: 162

Re: Eliminating duplicate records across three datasets

The indsname option is new to me and super cool, but this doesn't do what I want. I want to delete records common between the tables.

 

So with

data fy07_data;
   input ssn;
   datalines;
11
12
13
14
15
16
17
18
21
22
31
32
;

data fy08_data;
   input ssn;
   datalines;
11
12
21
22
23
24
25
26
27
28
33
34
;

data fy09_data;
   input ssn;
   datalines;
13
14
21
22
31
32
33
34
35
36
37
38
;


data combined;
set fy07_data fy08_data fy09_data indsname=source;
source_name = source;
run;

proc sort data=combined; 
by source_name ssn;
run;

proc sort data=combined nodupkey;
by ssn;
run;


proc print data=combined noobs; 
 var ssn source_name;
run;

I get a data set of

                                     ssn     source_name

                                      11    WORK.FY07_DATA
                                      12    WORK.FY07_DATA
                                      13    WORK.FY07_DATA
                                      14    WORK.FY07_DATA
                                      15    WORK.FY07_DATA
                                      16    WORK.FY07_DATA
                                      17    WORK.FY07_DATA
                                      18    WORK.FY07_DATA
                                      21    WORK.FY07_DATA
                                      22    WORK.FY07_DATA
                                      23    WORK.FY08_DATA
                                      24    WORK.FY08_DATA
                                      25    WORK.FY08_DATA
                                      26    WORK.FY08_DATA
                                      27    WORK.FY08_DATA
                                      28    WORK.FY08_DATA
                                      31    WORK.FY07_DATA
                                      32    WORK.FY07_DATA
                                      33    WORK.FY08_DATA
                                      34    WORK.FY08_DATA
                                      35    WORK.FY09_DATA
                                      36    WORK.FY09_DATA
                                      37    WORK.FY09_DATA
                                      38    WORK.FY09_DATA

I want a data set of

                                     ssn     source_name

                                      15    WORK.FY07_DATA
                                      16    WORK.FY07_DATA
                                      17    WORK.FY07_DATA
                                      18    WORK.FY07_DATA
                                      23    WORK.FY08_DATA
                                      24    WORK.FY08_DATA
                                      25    WORK.FY08_DATA
                                      26    WORK.FY08_DATA
                                      27    WORK.FY08_DATA
                                      28    WORK.FY08_DATA
                                      35    WORK.FY09_DATA
                                      36    WORK.FY09_DATA
                                      37    WORK.FY09_DATA
                                      38    WORK.FY09_DATA
Super User
Posts: 10,044

Re: Eliminating duplicate records across three datasets

data fy07_data;
   input ssn;
   datalines;
11
12
13
14
15
16
17
18
21
22
31
32
;

data fy08_data;
   input ssn;
   datalines;
11
12
21
22
23
24
25
26
27
28
33
34
;

data fy09_data;
   input ssn;
   datalines;
13
14
21
22
31
32
33
34
35
36
37
38
;


data combined;
set fy07_data fy08_data fy09_data indsname=source;
source_name = source;
run;
proc sql;
select distinct source_name,ssn
 from combined
  group by ssn 
   having count(distinct source_name)=1;
quit;
Super User
Posts: 10,044

Re: Eliminating duplicate records across three datasets


proc sql;
(
select * from fy07_data 
except 
(select * from fy08_data union select * from fy09_data )
)

union

(
select * from  fy08_data
except 
(select * from fy07_data union select * from fy09_data )
)

union

(
select * from  fy09_data
except 
(select * from fy08_data union select * from fy07_data )
)
;
quit;
Regular Contributor
Regular Contributor
Posts: 162

Re: Eliminating duplicate records across three datasets

Okay, with that much unioning and excepting going on, I feel better about my

        if myyear07flag = 1 and myyear08flag = 1 then delete;
	if myyear07flag = 1 and myyear09flag = 1 then delete; 

stuff.  Lol.

Ask a Question
Discussion stats
  • 5 replies
  • 146 views
  • 1 like
  • 3 in conversation