exclude the observations in both dataset A and B when they have matche...

France · Posted 11-08-2018 03:41 PM

Dear all,

How to exclude the observations in both dataset A and B when they have matched with each other.

For example,

in dataset A

name, number

apple,10

pear,11

pen,12

in dataset B

name, age

apple,5

fight,19

air,7

I would like to create a new dataset C, which includes

name, number, age

apple,10,5

and then exclude observations apple from dataset B and C, like

dataset B

name, number

pear,11

pen,12

dataset C

name, age

fight,19

air,7

Could you please give me some suggestions about this.

thanks in advance

PaigeMiller · Posted 11-08-2018 03:43 PM

Pear started off in data set A, how did it wind up in data set B?

--
Paige Miller

AnnaHawgood · Posted 11-08-2018 03:59 PM

If I have understood your question correctly, you should get the datasets you want (a1, b1 and c) with the the following code:

proc sort data=a;
 by name;
run;

proc sort data=b;
 by name;
run;

data a1 b1 c;
 merge a (in=a) b (in=b);
 by name;
 if a and not b then output a1;
 if b and not a then output b1;
 if a and b then output c;
run;

Reeza · Posted 11-08-2018 03:49 PM

Can you have duplicates within a single file? If not, consider appending the data and using PROC SORT with the UNIQUEKEY and DUPOUT options.

France · Posted 11-09-2018 09:34 AM

Dear Reeza,
thanks for your advice. Could you please explain the meaning in detail for me? Especially, what is the meaning of 'duplicates within a single file' ?

Reeza · Posted 11-09-2018 10:58 AM

1. Append the files - use a data step with INDSNAME option to see the source input file.

2. Read up on PROC SORT - specifcially UNIQUEKEY and NOUNIQUE

3. Use those options to push the duplicate keys to one data set and the others to a different data set.

@France wrote:
Dear Reeza,
thanks for your advice. Could you please explain the meaning in detail for me? Especially, what is the meaning of 'duplicates within a single file' ?

ballardw · Posted 11-08-2018 04:10 PM

I think you start getting your data set names messed up a bit.

This does what you ask, using different final data set names, for your example data.

data a;
   infile datalines dlm=',';
   input name $ number ;
datalines;
apple,10
pear,11
pen,12
;
run;
 
data b;
   infile datalines dlm=',';
   input name $ age ;
datalines;
apple,5
fight,19
air,7
;
run;

proc sort data=a;
  by name;
run;

proc sort data=b;
  by name;
run;

data n1 n2 (drop=age) n3 (drop=number);
   merge a (in=ina)
         b( in=inb )
   ;
   by name;
   if ina and inb then output n1;
   else if ina then output n2;
   else if inb then output n3;
run;

BTW almost any operation that reads dataset named XXX and writes a modified version back to XXX is very dangerous as you replace the existing data set and lose the starting data.

France · Posted 11-11-2018 12:31 PM

Dear ballardw,

thank you for your advice. I am using the following codes,

 
data sa_step6.Distinct_name_ori1;
  set Result.Distinct_name;
  rename psn_name=name;
run;

proc sort data=step1.WSCOPEUK;
  by name;
run;

proc sort data=sa_step6.Distinct_name_ori1;
  by name;
run;

data SA_STEP5.WSCOPEUK_TRY1 sa_step6.Distinct_name_ori2 sa_step6.WSCOPEUK_ori2 ;
   merge sa_step6.Distinct_name_ori1 (in=ina)
         step1.WSCOPEUK (in=ina)
   ;
   by name;
   if ina and inb then output SA_STEP5.WSCOPEUK_TRY1;
   else if ina then output sa_step6.Distinct_name_ori2;
   else if inb then output sa_step6.WSCOPEUK_ori2;
run;

however, I get the following result.

858
859  data sa_step6.Distinct_name_ori1;
860    set Result.Distinct_name;
861    rename psn_name=name;
862  run;

NOTE: There were 881490 observations read from the data set RESULT.DISTINCT_NAME.
NOTE: The data set SA_STEP6.DISTINCT_NAME_ORI1 has 881490 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           39.55 seconds
      cpu time            2.35 seconds


863
864  proc sort data=step1.WSCOPEUK;
865    by name;
866  run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds


867
868  proc sort data=sa_step6.Distinct_name_ori1;
869    by name;
870  run;

NOTE: There were 881490 observations read from the data set SA_STEP6.DISTINCT_NAME_ORI1.
NOTE: The data set SA_STEP6.DISTINCT_NAME_ORI1 has 881490 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           21.02 seconds
      cpu time            4.84 seconds


871
872  data SA_STEP5.WSCOPEUK_TRY1 sa_step6.Distinct_name_ori2 sa_step6.WSCOPEUK_ori2 ;
873     merge sa_step6.Distinct_name_ori1 (in=ina)
874           step1.WSCOPEUK (in=ina)
875     ;
876     by name;
877     if ina and inb then output SA_STEP5.WSCOPEUK_TRY1;
878     else if ina then output sa_step6.Distinct_name_ori2;
879     else if inb then output sa_step6.WSCOPEUK_ori2;
880  run;

NOTE: Variable inb is uninitialized.
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 881490 observations read from the data set SA_STEP6.DISTINCT_NAME_ORI1.
NOTE: There were 5664 observations read from the data set STEP1.WSCOPEUK.
NOTE: The data set SA_STEP5.WSCOPEUK_TRY1 has 0 observations and 15 variables.
NOTE: The data set SA_STEP6.DISTINCT_NAME_ORI2 has 887001 observations and 15 variables.
NOTE: The data set SA_STEP6.WSCOPEUK_ORI2 has 0 observations and 15 variables.
NOTE: DATA statement used (Total process time):
      real time           35.96 seconds
      cpu time            2.79 seconds

could you please give me some suggestion to fix this problem.

thanks in advanece

SuzanneDorinski · Posted 11-11-2018 01:29 PM

You used in=ina for both your data sets on the merge statement.

   merge sa_step6.Distinct_name_ori1 (in=ina)
         step1.WSCOPEUK (in=ina)

Change one of them to inb. I can't say at a glance which one you want to be inb.

That's why you have the note in the log that inb is uninitialized.

exclude the observations in both dataset A and B when they have matched with each other

Re: exclude the observations in both dataset A and B when they have matched with each other

Re: exclude the observations in both dataset A and B when they have matched with each other

Re: exclude the observations in both dataset A and B when they have matched with each other

Re: exclude the observations in both dataset A and B when they have matched with each other

Re: exclude the observations in both dataset A and B when they have matched with each other

Re: exclude the observations in both dataset A and B when they have matched with each other

Re: exclude the observations in both dataset A and B when they have matched with each other

Re: exclude the observations in both dataset A and B when they have matched with each other

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!