BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
tpt1
Calcite | Level 5

Hi-

I have 3 data sets that are created using PROC SURVEYSELECT, and that are then appended using a data step.  The 3 datasets have the potential to contain duplicate IDs between them.  The fact that there is duplication across the 3 datasets is expected and is not an issue, though I do not want records with duplicate IDs in my appended file.  The appended file can have no additional records that the end user could substitute in case a duplicate record is found.   

 

My first thought is to use DO UNTIL that could rerun the 3 PROC SURVEYSELECT statements until the appended file has no duplication, but this seems cumbersome.  Can anyone provide an efficient way to append multiple datasets and guarantee no duplication?  I included an abridged version of my code below:

proc surveyselect data=set_1 sampsize=30 out=set_1_sample;

RUN;

proc surveyselect data=set_2 sampsize=27 out=set_2_sample;

RUN;

proc surveyselect data=set_3 sampsize=45 out=set_3_sample;

RUN;

 

data append_set;

set set_1_sample set_2_sample set_3_sample;

run;

quit;

 

proc sort data=append_set ;

by ID;

run;

 

/* check for duplicates */

data dups;

set append_set;

by ID;

if first.ID and last.ID then ;

else output;

run;

 

/* identify the count of duplicates */

proc sql noprint;

select count(*) into :dup_count

from dups;

quit;

%put &dup_count;

 

/* add DO UNTIL which would run this code until the condition "dup_count=0" has been met */

DO UNTIL(&dup_count=0);

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

I think it would be best to consider all the data at once, eliminate duplicate ids at random and select your sample from that. Here is a simple way to do this, building on code provided by @ballardw:

 

data work.set1;
   do id = 1 to 1000;
      if rand('uniform') < .3 then output;
   end;
run;
data work.set2;
   do id = 1 to 1000;
      if rand('uniform') < .3 then output;
   end;
run;
data work.set3;
   do id = 1 to 1000;
      if rand('uniform') < .3 then output;
   end;
run;

data all;
   length dataname $8.;
   set set1 set2 set3 indsname=dsn;
   dataname = scan(dsn,2);
   rnd = rand("uniform");
run;

proc sort data=all; by id rnd; run;

data unique;
set all; by id;
if first.id;
drop rnd;
run;

proc sort data=unique; by dataname; run;

proc surveyselect 
    data = unique
    sampsize=(30 27 45)
    out = samples;
strata dataname;
run;
PG

View solution in original post

7 REPLIES 7
tpt1
Calcite | Level 5
I should clarify...I don't want BOTH records with duplicate IDs. One of the records in the append_set is fine.
VDD
Ammonite | Level 13 VDD
Ammonite | Level 13

try this

 

data dups;
set append_set; 
by ID; 
if not first.ID and not last.ID then output;
run;
tpt1
Calcite | Level 5

Thanks VDD, but I still have duplicate IDs in my append_set.  I don't see another way to remove them without reducing the required population size (which in this example is 102).

ballardw
Super User

I am a little concerned that the approach you have been following is going to actually yield incorrect selection probabilities and weights.

Consider if set_2 has 100 records and I select 10 with SRS. The probability of selection is .1 and weight would be 10.

But if 2 of them are duplicates and you remove them then the remaining records still have a weight of 10 but should actually be 12.5 (100 / 😎 if they are to represent the population of set_2.

 

And worse by compounding with the third set.

 

Perhaps you can describe the populations that set_1, set_2 and set_3 represent and why there are duplicates?

 

I might consider scrubbing the source data sets to identify the duplicate persons and randomly remove them from the other sets so the person only appears once across the three source sets.

 

Something like this:

data work.set1;
   do id = 1 to 1000;
      if rand('uniform') < .3 then output;
   end;
run;
data work.set2;
   do id = 1 to 1000;
      if rand('uniform') < .3 then output;
   end;
run;
data work.set3;
   do id = 1 to 1000;
      if rand('uniform') < .3 then output;
   end;
run;

data work.all;
   length dataname $30.;
   set work.set1 work.set2 work.set3 indsname=dsn;
   dataname = dsn;
run;

proc sort data=work.all;
   by id;
run;

proc transpose data=work.all out=work.trans;
   by id;
   var dataname;
run;

data work.choose;
   set work.trans;
   array c col: ;
   str= catx('|',of c(*));
   if countw(str,'|')=2 then do;
      set = c[rand('table',.5,.5)];
   end;
   else if countw(str,'|')=3 then do;
      set = c[rand('table',1/3,1/3,1/3)];
   end;
   else set=col1;
   keep id set;
run;

proc sort data=work.choose;
   by set;
run;

proc surveyselect data=work.choose
   sampsize=(30 27 45)
   out = work.selected
;
   strata set;
run;


You would have to merge the result of just working with the ids back to original data for other variables but that should not be difficult.

 

johnsville
Obsidian | Level 7

not sure if I understand all your requirements, and note that I did NOT actually run the attached code, but I do stuff like this all the time; maybe what you are looking for is accomplished like so:

 

/*
 assuming that ONE and only one observation from a set of duplicates
 (those having same ID value) are to be kept in the 'singles' dataset;
 observations for which there is one and only ID value are to be kept in the singles dataset;
 and that only records having more than one observation within a single ID value are to be kept in the multis dataset.
*/
data append_set;

set set_1_sample set_2_sample set_3_sample;
run;

/*
 if there is some variable that might be used to
 prioritize observations within a single ID,
 add those to the order by clause here;
 this would cause the most desirable observations to
 come out first.
*/

proc sql noprint ;
	create view append_set_v as
	select *
	from append_set
	order by id
;
quit ;

/*
 dataset singles will have the first observation encountered within 
 a set of duplicates within ID, along with all true singles;
 dataset multis will have only dupe-sets of GT 1 observation.

at end, inside the multis dataset, the max seq value within ID will contain the count of records having that ID. */ data singles multis ; retain seq 0 ; set append_set_view ; by ID ; if first.ID then do ; seq=1 ; output singles ; end ; if not ( first.ID AND last.ID ) then do ; output multis ; seq+1 ; end ; run ;
PGStats
Opal | Level 21

I think it would be best to consider all the data at once, eliminate duplicate ids at random and select your sample from that. Here is a simple way to do this, building on code provided by @ballardw:

 

data work.set1;
   do id = 1 to 1000;
      if rand('uniform') < .3 then output;
   end;
run;
data work.set2;
   do id = 1 to 1000;
      if rand('uniform') < .3 then output;
   end;
run;
data work.set3;
   do id = 1 to 1000;
      if rand('uniform') < .3 then output;
   end;
run;

data all;
   length dataname $8.;
   set set1 set2 set3 indsname=dsn;
   dataname = scan(dsn,2);
   rnd = rand("uniform");
run;

proc sort data=all; by id rnd; run;

data unique;
set all; by id;
if first.id;
drop rnd;
run;

proc sort data=unique; by dataname; run;

proc surveyselect 
    data = unique
    sampsize=(30 27 45)
    out = samples;
strata dataname;
run;
PG
tpt1
Calcite | Level 5

PG Stats-

Yes, removing duplicates at random makes good sense.  Thanks so much to everyone for the great feedback!

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1657 views
  • 1 like
  • 5 in conversation