BookmarkSubscribeRSS Feed
GN0001
Barite | Level 11

Hello all,

I have 64 rows and I need to find them in a bigger data set. Here, my inquiry is about the logic that I have applied.

I did a left outer join with missing on the second dataset and I was able to find those rows that exist in the first/left dataset, but it doesn't exist in the second/right dataset. The result came 4 rows.

 

Then I did inner join between first and second dataset with Distinct values and came 54 rows.

If I do without distinct, the result is 169 rows. 

 

54+4 equal to 58 and deduct it from 64, we have 6 rows left, which I don't know what is going about them.

Can we say since we put distinct, we removed duplicates, in fact those 6 rows are among those 54 rows, they are just taken away by Distinct because they are duplicates.

Here, I am looking for logic. I need to know the way I elaborate it is correct. I need to present it to my boss.

Regards,

Blue&Blue

Blue Blue
5 REPLIES 5
Tom
Super User Tom
Super User

How many of the 64 are distinct?

What do you mean by distinct?  Normally in SQL that means unique observations.  Two observations could have the same value of the key variable but still be unique because they have different values of other variables.

GN0001
Barite | Level 11
Hello,
Thanks for the reply,
Then if the join values are same but they have different values in other variables, that means the observations are unique. 54 are unique out of 64.
Regards,
Blue & Blue
Blue Blue
Tom
Super User Tom
Super User

If you want control use data step instead of SQL.

For example you might MERGE the two dataset by the key variable and use the IN= dataset option to be able to track which dataset contributed to this observations.

 

So if you have two datasets named ONE and TWO and the key variable is named ID then the code might look like this:

data want;
  merge one(in=in1) two(in=in2);
  by id;
  if in1 and in2 then source='BOTH';
  else if in1 then source='ONE ';
  else source='TWO';
  unique=first.id;
run;

proc summary data=want;
  class source;
  var unique;
  output out=summary N=COUNT sum(unique)=UNIQUE ;
run;

proc print data=summary;
   var source count unique;
run;
GN0001
Barite | Level 11

Hello Tom,

Thanks for the code.

Can you explain about class and source?

proc summary data=want;
  class source;
  var unique;
  output out=summary N=COUNT sum(unique)=UNIQUE ;
run;

Regards,

blue&blue

Blue Blue
Tom
Super User Tom
Super User

SOURCE is the variable name.

CLASS is a common SAS statement that is supported in various ways by a lot of SAS procedure.  It basically is for telling the procedure to group the data by the levels of that variable.

For more information read the documentation on PROC SUMMARY (also known as PROC MEANS).

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 728 views
  • 3 likes
  • 2 in conversation