Dear all,
How to exclude the observations in both dataset A and B when they have matched with each other.
For example,
in dataset A
name, number
apple,10
pear,11
pen,12
in dataset B
name, age
apple,5
fight,19
air,7
I would like to create a new dataset C, which includes
name, number, age
apple,10,5
and then exclude observations apple from dataset B and C, like
dataset B
name, number
pear,11
pen,12
dataset C
name, age
fight,19
air,7
Could you please give me some suggestions about this.
thanks in advance
Pear started off in data set A, how did it wind up in data set B?
If I have understood your question correctly, you should get the datasets you want (a1, b1 and c) with the the following code:
proc sort data=a; by name; run; proc sort data=b; by name; run; data a1 b1 c; merge a (in=a) b (in=b); by name; if a and not b then output a1; if b and not a then output b1; if a and b then output c; run;
1. Append the files - use a data step with INDSNAME option to see the source input file.
2. Read up on PROC SORT - specifcially UNIQUEKEY and NOUNIQUE
3. Use those options to push the duplicate keys to one data set and the others to a different data set.
@France wrote:
Dear Reeza,
thanks for your advice. Could you please explain the meaning in detail for me? Especially, what is the meaning of 'duplicates within a single file' ?
I think you start getting your data set names messed up a bit.
This does what you ask, using different final data set names, for your example data.
data a; infile datalines dlm=','; input name $ number ; datalines; apple,10 pear,11 pen,12 ; run; data b; infile datalines dlm=','; input name $ age ; datalines; apple,5 fight,19 air,7 ; run; proc sort data=a; by name; run; proc sort data=b; by name; run; data n1 n2 (drop=age) n3 (drop=number); merge a (in=ina) b( in=inb ) ; by name; if ina and inb then output n1; else if ina then output n2; else if inb then output n3; run;
BTW almost any operation that reads dataset named XXX and writes a modified version back to XXX is very dangerous as you replace the existing data set and lose the starting data.
Dear ballardw,
thank you for your advice. I am using the following codes,
data sa_step6.Distinct_name_ori1;
set Result.Distinct_name;
rename psn_name=name;
run;
proc sort data=step1.WSCOPEUK;
by name;
run;
proc sort data=sa_step6.Distinct_name_ori1;
by name;
run;
data SA_STEP5.WSCOPEUK_TRY1 sa_step6.Distinct_name_ori2 sa_step6.WSCOPEUK_ori2 ;
merge sa_step6.Distinct_name_ori1 (in=ina)
step1.WSCOPEUK (in=ina)
;
by name;
if ina and inb then output SA_STEP5.WSCOPEUK_TRY1;
else if ina then output sa_step6.Distinct_name_ori2;
else if inb then output sa_step6.WSCOPEUK_ori2;
run;
however, I get the following result.
858
859 data sa_step6.Distinct_name_ori1;
860 set Result.Distinct_name;
861 rename psn_name=name;
862 run;
NOTE: There were 881490 observations read from the data set RESULT.DISTINCT_NAME.
NOTE: The data set SA_STEP6.DISTINCT_NAME_ORI1 has 881490 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 39.55 seconds
cpu time 2.35 seconds
863
864 proc sort data=step1.WSCOPEUK;
865 by name;
866 run;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
867
868 proc sort data=sa_step6.Distinct_name_ori1;
869 by name;
870 run;
NOTE: There were 881490 observations read from the data set SA_STEP6.DISTINCT_NAME_ORI1.
NOTE: The data set SA_STEP6.DISTINCT_NAME_ORI1 has 881490 observations and 2 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 21.02 seconds
cpu time 4.84 seconds
871
872 data SA_STEP5.WSCOPEUK_TRY1 sa_step6.Distinct_name_ori2 sa_step6.WSCOPEUK_ori2 ;
873 merge sa_step6.Distinct_name_ori1 (in=ina)
874 step1.WSCOPEUK (in=ina)
875 ;
876 by name;
877 if ina and inb then output SA_STEP5.WSCOPEUK_TRY1;
878 else if ina then output sa_step6.Distinct_name_ori2;
879 else if inb then output sa_step6.WSCOPEUK_ori2;
880 run;
NOTE: Variable inb is uninitialized.
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 881490 observations read from the data set SA_STEP6.DISTINCT_NAME_ORI1.
NOTE: There were 5664 observations read from the data set STEP1.WSCOPEUK.
NOTE: The data set SA_STEP5.WSCOPEUK_TRY1 has 0 observations and 15 variables.
NOTE: The data set SA_STEP6.DISTINCT_NAME_ORI2 has 887001 observations and 15 variables.
NOTE: The data set SA_STEP6.WSCOPEUK_ORI2 has 0 observations and 15 variables.
NOTE: DATA statement used (Total process time):
real time 35.96 seconds
cpu time 2.79 seconds
could you please give me some suggestion to fix this problem.
thanks in advanece
You used in=ina for both your data sets on the merge statement.
merge sa_step6.Distinct_name_ori1 (in=ina)
step1.WSCOPEUK (in=ina)
Change one of them to inb. I can't say at a glance which one you want to be inb.
That's why you have the note in the log that inb is uninitialized.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.