DATA Step, Macro, Functions and more

Comparing same variable from two datasets

Reply
Contributor
Posts: 72

Comparing same variable from two datasets

Hi SAS users,

 

I have two dataset which have one variable in common (ID). Now, I would like to simply compare this variable between these two datasets; and make sure how many observations of the common variable (ID) from the smaller dataset exist in the bigger dataset.

 

I used proc compare but I am not sure if its the best method. Please advise.

 

Thanks

Super User
Posts: 17,868

Re: Comparing same variable from two datasets

It works. It depends on what you want as output and what meets your requirement. 

 

So what are your requirements besides what's in both? What do you plan to do with this information afterwards.

 

Personally I find proc compare a bit useless except for quick comparisons and generally write my own code for comparisons.

Contributor
Posts: 72

Re: Comparing same variable from two datasets

[ Edited ]

what I want is just a simple comparison based on how many variable ID is matched in both datasets. No specific requirement.

 

Thanks

Contributor
Posts: 72

Re: Comparing same variable from two datasets

[ Edited ]

DATA1:

 

ID  A  B

8    .   .

9    .   .

 

 

DATA2:

 

ID C  D

1   5  6

2   7  8

3   9  3

4   6  6

 

proc compare base = data1 compare=data2;

var ID;

with ID;

run;

 

 

Why after running the program, in the observation summary it says:

Number of Observations in Common: 2  ?!!

 

Although, we don't have any common observation in varibale ID. Right?

 

 

Thank you

 

 

 

Trusted Advisor
Posts: 1,115

Re: Comparing same variable from two datasets

Hi @almmotamedi,

 

If you want to use PROC COMPARE for this purpose (and not a data step with MERGE statement or PROC SQL), you should use the ID statement instead of the WITH statement.

Contributor
Posts: 72

Re: Comparing same variable from two datasets

Thank you, but how using ID statement gives me observation summary? (I tried and it only gives variable summary!). Please help

Super User
Posts: 17,868

Re: Comparing same variable from two datasets

Not quite right, but enough to get you started.

 

DATA data1;
 
input ID  A  B;
cards;
8    .   .
9    .   .
;
 
data DATA2;
 
input ID C  D;
cards;
1   5  6
2   7  8
3   9  3
4   6  6
;

proc sql;
create table want as
select coalesce(a.id,b.id) as ID, a.ID as ID1, b.ID as ID2,
	case when a.ID=. then 'Data1'
	     when b.ID=. then 'Data2'
		 else 'CHECKME' end as source
from data1 as a
full join data2 as b
on a.id=b.id;
quit;

proc print data=want;
run;
Contributor
Posts: 72

Re: Comparing same variable from two datasets

[ Edited ]

Thank you so much, your answered worked for the sample datasets. But when I applied the code for the real datasets, I received the error below:

 

ERROR: Expression using equals (=) has components that are of different data types.

 

for on a.id=b.id;

 

I had checked the data types and both variables are character with the same length !

 

Could you please advise?

Super User
Posts: 10,516

Re: Comparing same variable from two datasets

Show us the proc contents results for the two sets that are involved in the error.

Contributor
Posts: 72

Re: Comparing same variable from two datasets

I checked the results and everything is the same except in one dataset the variable has "format" and "informat", but the other does not. 

 

1.jpg

 

2.jpg

 

Trusted Advisor
Posts: 1,115

Re: Comparing same variable from two datasets

The error message from PROC SQL does not refer to the ON clause, but to the WHEN conditions where character ID's would not match the type of numeric missing values. Just replace the latter with character missings ' ' or use neutral conditions like

when a.ID is null

or

when missing(a.ID)
Trusted Advisor
Posts: 1,115

Re: Comparing same variable from two datasets


almmotamedi wrote:

Thank you, but how using ID statement gives me observation summary? (I tried and it only gives variable summary!). Please help


This would be the case if you omitted the VAR statement.

 

My suggestion, however, was to "use the ID statement instead of the WITH statement."

proc compare base=data1 compare=data2;
var ID;
id ID;
run;

(But my first choice for this task would not be PROC COMPARE.)

 

Contributor
Posts: 72

Re: Comparing same variable from two datasets

So, could you please advise how you would do it, this task seems to be very easy but I am still struggling to find an answer!

Trusted Advisor
Posts: 1,115

Re: Comparing same variable from two datasets

[ Edited ]

My approach would depend on the characteristics of the two datasets and other circumstances:

 

  1. Are the datasets sorted (or indexed) by ID?
  2. Are there (possibly) duplicate ID values in one or in both datasets?
  3. If so, how should these be counted if they match?
  4. Are there missing values of ID?
  5. How large are the datasets?
  6. Do they have other variables in common except ID?
  7. Is it a quick-and-dirty count for your own purposes or part of a production-level program?
  8. Shall the number of matching ID values be written to the log, the output window, a dataset, a macro variable, ...?

 

[Edit: inserted missing word in item 7]

Super User
Posts: 9,687

Re: Comparing same variable from two datasets

Does SQL do this job ?
proc sql;
select id from small
intersect
select id from big;
quit;
Ask a Question
Discussion stats
  • 14 replies
  • 533 views
  • 4 likes
  • 5 in conversation