Dear all,
I want to join the following two datasets, I need to keep all records including non-matching ones.
My question to you is how could I create some identification variables, that will show whether a observation is coming from dataset one or dataset two.
thanks,
proc sql;
create table full (drop=cusip2 year2) as
select a.*, b.cusip as cusip2, b.year as year2, b.bb
from one as a full join two as b
on a.cusip=b.cusip and a.year=b.year;
quit;
Data one
Cusip Year aa
100 97 10
101 98 2
102 97 10
103 98 2
Data two
Cusip Year bb
101 98 34
102 97 2
103 98 3
104 99 5
Hi,
How about the following one...
proc sql;
create table both as
select coalesce(a.cusip,b.cusip) as cusip,
coalesce(a.year,b.year) as year,
aa,bb,
case
when bb = . then "Two"
when aa = . then "One"
else "Match"
end as source
from one as a
full join
two as b
on a.cusip = b.cusip;
quit;
-Urvish
You can add in some case statements that checks if the variables are empty to identify which is the data source.
Data one;
input Cusip Year aa;
cards;
100 97 10
101 98 2
102 97 10
103 98 2
;
Data two;
input Cusip Year bb;
cards;
101 98 34
102 97 2
103 98 3
104 99 5
;
proc sql;
create table full as
select a.*, b.cusip as cusip2, b.year as year2, b.bb
,case when b.cusip = a.cusip and a.year=b.year then 'Both'
when a.cusip=. then 'B'
else "A" end as source
from one as a
full outer join two as b
on a.cusip=b.cusip and a.year=b.year;
quit;
Thank you,
I used your code as below, and sas log produced an error, please help!
ERROR: Expression using equals (=) has components that are of different data types.
proc sql;
create table full as
select a.*, b.cusip as cusip2, b.year as year2, b.avgHORI, b.avgSPEC, b.avgcount
,case when b.cusip = a.cusip and a.year=b.year then 'Both'
when a.cusip=. then 'B'
else 'A' end as source
from private2 as a full outer join forecast as b
on a.cusip=b.cusip and a.year=b.year;
quit;
Sounds like CUSIP is a character variable in your data.
Instead of comparing it to numeric missing use the IS NULL syntax of SQL or the CMISS() function of SAS.
when a.cusip is null then 'B'
when cmiss(a.cusip) then 'B'
Try using NATURAL join.
Compare the results of these two queries.
proc sql ;
select * from one natural join two ;
select * from one natural full join two ;
quit;
You can add an extra variable by using sub query.
select *
from (select *,1 as in1 from one) a
natural full join
( select *,1 as in2 from two) b
;
How about:
Data one; input Cusip Year aa; cards; 100 97 10 101 98 2 102 97 10 103 98 2 ; Data two; input Cusip Year bb; cards; 101 98 34 102 97 2 103 98 3 104 99 5 ; proc sql; create table full as select a.*,bb,'Both' as source from one as a inner join two as b on a.cusip=b.cusip and a.year=b.year outer union corresponding select a.*,'A' as source from one as a where cats(a.cusip,a.year) not in ( select cats(cusip,year) from two ) outer union corresponding select b.*,'B' as source from two as b where cats(b.cusip,b.year) not in ( select cats(cusip,year) from one ) ; quit;
Ksharp
Hi,
How about the following one...
proc sql;
create table both as
select coalesce(a.cusip,b.cusip) as cusip,
coalesce(a.year,b.year) as year,
aa,bb,
case
when bb = . then "Two"
when aa = . then "One"
else "Match"
end as source
from one as a
full join
two as b
on a.cusip = b.cusip;
quit;
-Urvish
thanks to all of you who offered your help.
I like UrvishShah's answer the best , because it is easier for a beginner like me to use.
One more question,
my original post shows
data one has 3 variables: cusip year aa
data two has 3 variables: cusip year bb
but my actual data has
data one has many variables: cusip year aa1 aa2 aa3 .....
data two has many variables: cusip year bb1, bb2, .....
to Urvish or anyone who reads this,
do I just list on line 5 (see Urvish posting above).
aa1,aa2,aa3,....
or can I do
a.*, b.*,
I ask because I do not want to type so many variable names.
Lan
Two choices to make it easier to use * in select statement when joining tables.
1) Use the NATURAL keyword. SAS will identify the variables with the same names and use them all for joining. Common variables will automatically be coalesced.
create table want as
select * from one natural full join two
;
2) Use RENAME and DROP dataset options so that key variables from different datasets have different names.
create table want (drop=CUSIP1 YEAR1 CUSIP2 YEAR2) as
select coalesce(cusip1,cusip2) as cusip
, coalesce(year1,year2) as year
, *
from one (rename=(cusip=cusip1 year=year1))
full join
two (rename=(cusip=cusip2 year=year2))
on cusip1 = cusip2 and year1=year2
;
Note that if you want to be positive about source of each observations you cannot depend on testing for missing values of variables from the source datasets unless you are positive that those variables can never have missing values.
create table want as
select case when (in1=1 and in2=1) then 'BOTH'
when (in1=1) then 'ONE'
else 'TWO'
end as source
, *
from (select *,1 as in1 from one)
natural full join
(select *,1 as in2 from two)
;
;
Thanks a lot, Tom .
Use the %DO Iterative Loops within the macro to refrence many variables at one time...If you are not cleared about no of variables in your dataset then count the no of variables first of all and then save it in one macro variable...I have assumed that there are 20 variables in your dataset...
I have not tested it but should use...Hope it works...
%macro obs_indentify(no_of_vars =);
proc sql;
create table both as
select coalesce(a.cusip,b.cusip) as cusip,
coalesce(a.year,b.year) as year,
aa,bb,
case
%do i = 1 %to &no_of_vars.;
when bb&i. = . then "Two"
when aa&i. = . then "One"
%end; else "Match"
end as source
from one as a
full join
two as b
on a.cusip = b.cusip;
quit;
%mend;
%obs_identify(no_of_vars = 20);
-Urvish
Urvish,
In your macro, you specify 20 variables, does it count the common variables in both data one and data two.
that is,
e.g. data one has variables: cusip, year, aa, aa1, aa2. (5 vars)
e.g. data two has variables: cusip, year, bb, bb1, bb2. (5 vars)
in this case is no_of_vars =10, or 8 ?
2nd question, in your macro line 5, do we still need to list aa, bb, Or should we remove it,
thanks!
One problem, if aa or bb has some missing value , Urvish's code will not work any more.
Data one; input Cusip Year aa; cards; 100 97 10 101 98 2 102 97 . 103 98 2 ; Data two; input Cusip Year bb; cards; 101 98 34 102 97 . 103 98 3 104 99 5 ;
Ksharp
Hi Ksharp,
Thanks for correction...Here i have updated my above code...Hope it works in case of any kind of Data Structure...
Data one;
input Cusip Year aa;
cards;
100 97 10
101 98 2
102 97 .
103 98 2
;
Data two;
input Cusip Year bb;
cards;
101 98 .
102 97 .
103 98 3
104 99 5
;
proc sql;
create table both as
select coalesce(a.cusip,b.cusip) as cusip,
coalesce(a.year,b.year) as year,
aa,bb,
case
when a.cusip = b.cusip then "Match"
when aa = . and bb = . then "Match"
when bb = . then "Two"
when aa = . then "One"
else "Match"
end as source
from one as a
full join
two as b
on a.cusip = b.cusip;
quit;
-Urvish
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.