BookmarkSubscribeRSS Feed
France
Quartz | Level 8

Dear all,

 

How to exclude the observations in both dataset A and B when they have matched with each other.

 

For example, 

in dataset A

 

name, number 

apple,10

pear,11

pen,12

 

in dataset B

 

name, age 

apple,5

fight,19

air,7

 

I would like to create a new dataset C, which includes

name, number, age 

apple,10,5

 

and then exclude observations apple from dataset B and C, like

 

dataset B

name, number

pear,11

pen,12

 

dataset C

name, age 

fight,19

air,7

 

Could you please give me some suggestions about this.

thanks in advance

 

 

 

8 REPLIES 8
PaigeMiller
Diamond | Level 26

Pear started off in data set A, how did it wind up in data set B?

--
Paige Miller
AnnaHawgood
Fluorite | Level 6

If I have understood your question correctly, you should get the datasets you want (a1, b1 and c) with the the following code:

 

proc sort data=a;
 by name;
run;

proc sort data=b;
 by name;
run;

data a1 b1 c;
 merge a (in=a) b (in=b);
 by name;
 if a and not b then output a1;
 if b and not a then output b1;
 if a and b then output c;
run;

Reeza
Super User
Can you have duplicates within a single file? If not, consider appending the data and using PROC SORT with the UNIQUEKEY and DUPOUT options.
France
Quartz | Level 8
Dear Reeza,
thanks for your advice. Could you please explain the meaning in detail for me? Especially, what is the meaning of 'duplicates within a single file' ?
Reeza
Super User

1. Append the files - use a data step with INDSNAME option to see the source input file.

2. Read up on PROC SORT - specifcially UNIQUEKEY and NOUNIQUE

3. Use those options to push the duplicate keys to one data set and the others to a different data set. 

 


@France wrote:
Dear Reeza,
thanks for your advice. Could you please explain the meaning in detail for me? Especially, what is the meaning of 'duplicates within a single file' ?

 

ballardw
Super User

I think you start getting your data set names messed up a bit.

This does what you ask, using different final data set names, for your example data.

data a;
   infile datalines dlm=',';
   input name $ number ;
datalines;
apple,10
pear,11
pen,12
;
run;
 
data b;
   infile datalines dlm=',';
   input name $ age ;
datalines;
apple,5
fight,19
air,7
;
run;

proc sort data=a;
  by name;
run;

proc sort data=b;
  by name;
run;

data n1 n2 (drop=age) n3 (drop=number);
   merge a (in=ina)
         b( in=inb )
   ;
   by name;
   if ina and inb then output n1;
   else if ina then output n2;
   else if inb then output n3;
run;

BTW almost any operation that reads dataset named XXX and writes a modified version back to XXX is very dangerous as you replace the existing data set and lose the starting data.

 

France
Quartz | Level 8

Dear ballardw,

 

thank you for your advice. I am using the following codes,

 
data sa_step6.Distinct_name_ori1;
  set Result.Distinct_name;
  rename psn_name=name;
run;

proc sort data=step1.WSCOPEUK;
  by name;
run;

proc sort data=sa_step6.Distinct_name_ori1;
  by name;
run;

data SA_STEP5.WSCOPEUK_TRY1 sa_step6.Distinct_name_ori2 sa_step6.WSCOPEUK_ori2 ;
   merge sa_step6.Distinct_name_ori1 (in=ina)
         step1.WSCOPEUK (in=ina)
   ;
   by name;
   if ina and inb then output SA_STEP5.WSCOPEUK_TRY1;
   else if ina then output sa_step6.Distinct_name_ori2;
   else if inb then output sa_step6.WSCOPEUK_ori2;
run;

however, I get the following result.

 

858
859  data sa_step6.Distinct_name_ori1;
860    set Result.Distinct_name;
861    rename psn_name=name;
862  run;

NOTE: There were 881490 observations read from the data set RESULT.DISTINCT_NAME.
NOTE: The data set SA_STEP6.DISTINCT_NAME_ORI1 has 881490 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           39.55 seconds
      cpu time            2.35 seconds


863
864  proc sort data=step1.WSCOPEUK;
865    by name;
866  run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds


867
868  proc sort data=sa_step6.Distinct_name_ori1;
869    by name;
870  run;

NOTE: There were 881490 observations read from the data set SA_STEP6.DISTINCT_NAME_ORI1.
NOTE: The data set SA_STEP6.DISTINCT_NAME_ORI1 has 881490 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           21.02 seconds
      cpu time            4.84 seconds


871
872  data SA_STEP5.WSCOPEUK_TRY1 sa_step6.Distinct_name_ori2 sa_step6.WSCOPEUK_ori2 ;
873     merge sa_step6.Distinct_name_ori1 (in=ina)
874           step1.WSCOPEUK (in=ina)
875     ;
876     by name;
877     if ina and inb then output SA_STEP5.WSCOPEUK_TRY1;
878     else if ina then output sa_step6.Distinct_name_ori2;
879     else if inb then output sa_step6.WSCOPEUK_ori2;
880  run;

NOTE: Variable inb is uninitialized.
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 881490 observations read from the data set SA_STEP6.DISTINCT_NAME_ORI1.
NOTE: There were 5664 observations read from the data set STEP1.WSCOPEUK.
NOTE: The data set SA_STEP5.WSCOPEUK_TRY1 has 0 observations and 15 variables.
NOTE: The data set SA_STEP6.DISTINCT_NAME_ORI2 has 887001 observations and 15 variables.
NOTE: The data set SA_STEP6.WSCOPEUK_ORI2 has 0 observations and 15 variables.
NOTE: DATA statement used (Total process time):
      real time           35.96 seconds
      cpu time            2.79 seconds


could you please give me some suggestion to fix this problem.

 

thanks in advanece

SuzanneDorinski
Lapis Lazuli | Level 10

You used in=ina for both your data sets on the merge statement.

 

   merge sa_step6.Distinct_name_ori1 (in=ina)
         step1.WSCOPEUK (in=ina)

Change one of them to inb.  I can't say at a glance which one  you want to be inb.  

 

That's why you have the note in the log that inb is uninitialized.

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 4838 views
  • 3 likes
  • 6 in conversation