Hi all SAS Users,
From my understanding, we join or match two datasets mainly by using the proc SQL (inner, right, and left-join).
However, my case below does not work on that way.
I have two dataset called: winsorize and industry_return. The simplified versions of these two datasets are as below:
winsorize
Type INDC3 Year var1 var2
AXX CNSTM 1994 0.01 0.08
MAS CNSTM 1996 0.05 .
RAF CNSTM 1994 0.07 0.08
AGG CNSTM 2004 0.07 .
CCC CNSTM 1996 0.02 .
RBB ENEGY 1998 0.05 0.88
RFB ENEGY 1999 0.06 0.89
TYB ENEGY 2018 0.05 0.48
industry_return
INDC3 Year var3
CNSTM 1988 1.2
CNTSM 1989 1.3
CNTSM 1990 1.5
CNTSM 1994 1.1
CNTSM 1996 1.7
CNTSM 2004 1.9
CNTSM 2018 2.1
ENEGY 1988 3.3
ENEGY 1996 3.5
ENEGY 1998 3.9
ENEGY 1999 4.2
ENEGY 2018 4.4
I want a results called "matching"
Type INDC3 Year var1 var2 var3
AXX CNSTM 1994 0.01 0.08 1.1
MAS CNSTM 1996 0.05 . 1.7
RAF CNSTM 1994 0.07 0.08 1.1
AGG CNSTM 2004 0.07 . 1.9
CCC CNSTM 1996 0.02 . 1.7
RBB ENEGY 1998 0.05 0.88 3.9
RFB ENEGY 1999 0.06 0.89 4.2
TYB ENEGY 2018 0.05 0.48 4.4
And I see that this match seems not to be able to be done by using three join styles of proc sql.
Could you please suggest to me how to sort it out?
Warmest regards.
Hi @Phil_NZ,
Let me quote your initial post:
@Phil_NZ wrote:
I want a results called "matching"
Type INDC3 Year var1 var2 var3 AXX CNSTM 1994 0.01 0.08 1.1 MAS CNSTM 1996 0.05 . 1.7 RAF CNSTM 1994 0.07 0.08 1.1 AGG CNSTM 2004 0.07 . 1.9 CCC CNSTM 1996 0.02 . 1.7 RBB ENEGY 1998 0.05 0.88 3.9 RFB ENEGY 1999 0.06 0.89 4.2 TYB ENEGY 2018 0.05 0.48 4.4
When I used dataset winsorize instead of view _wins in the PROC SQL step (and no ORDER BY clause and no reference to variable _seqno) I obtained:
Type INDC3 Year var1 var2 var3 RAF CNSTM 1994 0.07 0.08 1.1 AXX CNSTM 1994 0.01 0.08 1.1 CCC CNSTM 1996 0.02 . 1.7 MAS CNSTM 1996 0.05 . 1.7 RBB ENEGY 1998 0.05 0.88 3.9 RFB ENEGY 1999 0.06 0.89 4.2 AGG CNSTM 2004 0.07 . 1.9 TYB ENEGY 2018 0.05 0.48 4.4
As you see, the order of observations is different from the result you wanted (which is typical for PROC SQL and SQL in general when you don't use an ORDER BY clause or certain other language elements), making it harder to check if the result meets your requirements. But if that's not a problem, i.e., "sort order [of the result] doesn't matter," then forget about _wins and _seqno. For the join PROC SQL doesn't need any specific sort order.
Note that table "matching" is not sorted by any obvious unique combination of available sort keys (e.g., INDC3, Type or INDC3, Year, var1 or descending anything). So, to obtain the exact table "matching" from PROC SQL there's almost no other option than to create a new sort key such as _seqno, based on dataset winsorize.
Please explain the logic that allows you to go from the input data set to the output data set.
Could you also please provide the data as DATA step code instead of columns of data?
Perhaps I'm missing something when skimming through your data, but this should work with a single join, using two columns in the join condition (INDC3 and Year)?
Hi @Phil_NZ,
I agree with @LinusH. Only if the sort order of matching must be the same as that of winsorize and the real data don't contain a suitable sort key, you may need to add a sequence number in a preliminary DATA step as shown below.
data winsorize;
input Type $ INDC3 $ Year var1 var2;
cards;
AXX CNSTM 1994 0.01 0.08
MAS CNSTM 1996 0.05 .
RAF CNSTM 1994 0.07 0.08
AGG CNSTM 2004 0.07 .
CCC CNSTM 1996 0.02 .
RBB ENEGY 1998 0.05 0.88
RFB ENEGY 1999 0.06 0.89
TYB ENEGY 2018 0.05 0.48
;
data industry_return;
input INDC3 $ Year var3;
cards;
CNSTM 1988 1.2
CNTSM 1989 1.3
CNTSM 1990 1.5
CNTSM 1994 1.1
CNTSM 1996 1.7
CNTSM 2004 1.9
CNTSM 2018 2.1
ENEGY 1988 3.3
ENEGY 1996 3.5
ENEGY 1998 3.9
ENEGY 1999 4.2
ENEGY 2018 4.4
;
data _wins / view=_wins;
set winsorize;
_seqno=_n_;
run;
proc sql;
create table matching(drop=_seqno) as
select a.*, var3
from _wins a left join industry_return b
on spedis(a.indc3, b.indc3)<=10 & a.year=b.year
order by _seqno;
quit;
(Increase the "tolerance" 10 in the ON condition if there are more severe typos (?) in INDC3 than just "CNSTM" vs. "CNTSM", but make sure that no actual mismatches are tolerated.)
Hi @FreelanceReinh , @Kurt_Bremser , @LinusH , and @PaigeMiller
Sorry it is totally my typos, so when I edit my data consistently, whether the changed line in the code below is correct, I just wanna cross-check:
Simplistically speaking, yes, I match based on INDC3 and Year.
data winsorize;
input Type $ INDC3 $ Year var1 var2;
cards;
AXX CNSTM 1994 0.01 0.08
MAS CNSTM 1996 0.05 .
RAF CNSTM 1994 0.07 0.08
AGG CNSTM 2004 0.07 .
CCC CNSTM 1996 0.02 .
RBB ENEGY 1998 0.05 0.88
RFB ENEGY 1999 0.06 0.89
TYB ENEGY 2018 0.05 0.48
;
data industry_return;
input INDC3 $ Year var3;
cards;
CNSTM 1988 1.2
CNSTM 1989 1.3
CNSTM 1990 1.5
CNSTM 1994 1.1
CNSTM 1996 1.7
CNSTM 2004 1.9
CNSTM 2018 2.1
ENEGY 1988 3.3
ENEGY 1996 3.5
ENEGY 1998 3.9
ENEGY 1999 4.2
ENEGY 2018 4.4
;
data _wins / view=_wins;
set winsorize;
_seqno=_n_;
run;
proc sql;
create table matching(drop=_seqno) as
select a.*, var3
from _wins a left join industry_return b
on a.indc3 = b.indc3 & a.year=b.year /*changed line*/
order by _seqno;
quit;
Apart from that, I do not know what is the role and purpose of these lines of code in my case???
data _wins / view=_wins;
and
_seqno=_n_;
Warm regards.
Yes, with consistent data, of course, you don't need functions (like SPEDIS) supporting "fuzzy" comparisons and the criterion regarding INDC3 simplifies to a.indc3=b.indc3. I introduced the sequence number _seqno just to ensure that datasets winsorize and matching have the same sort order. If sort order doesn't matter, you should omit the DATA step creating the view _wins and use dataset winsorize instead in the PROC SQL step (as this will improve performance and simplify the code):
proc sql;
create table matching as
select a.*, var3
from winsorize a left join industry_return b
on a.indc3=b.indc3 & a.year=b.year;
quit;
I used a view rather than a dataset to create the temporary sort key _seqno in order to save disk space (of about the size occupied by dataset winsorize).
Thank you very much for your explanation.
Can I ask, so "same sort of order" means similar in "ascending or descending" order , is not it?
I could not get the idea " sort order doesn't matter", because I think SAS will automatically match from Table b with b based on the match in "INDC3" and "year". And in PROC SQL, we do not need to care about the sort or else?
Please let me know if I fall into any fallacy.
Cheers.
P/S: I also have some "trials and errors"test, when applying your code with _seqno, the order of observations in table matching following that of winsorize. Without applying _seqno, the order of dataset matching similar to that of industry_return. Interesting to me.
Hi @Phil_NZ,
Let me quote your initial post:
@Phil_NZ wrote:
I want a results called "matching"
Type INDC3 Year var1 var2 var3 AXX CNSTM 1994 0.01 0.08 1.1 MAS CNSTM 1996 0.05 . 1.7 RAF CNSTM 1994 0.07 0.08 1.1 AGG CNSTM 2004 0.07 . 1.9 CCC CNSTM 1996 0.02 . 1.7 RBB ENEGY 1998 0.05 0.88 3.9 RFB ENEGY 1999 0.06 0.89 4.2 TYB ENEGY 2018 0.05 0.48 4.4
When I used dataset winsorize instead of view _wins in the PROC SQL step (and no ORDER BY clause and no reference to variable _seqno) I obtained:
Type INDC3 Year var1 var2 var3 RAF CNSTM 1994 0.07 0.08 1.1 AXX CNSTM 1994 0.01 0.08 1.1 CCC CNSTM 1996 0.02 . 1.7 MAS CNSTM 1996 0.05 . 1.7 RBB ENEGY 1998 0.05 0.88 3.9 RFB ENEGY 1999 0.06 0.89 4.2 AGG CNSTM 2004 0.07 . 1.9 TYB ENEGY 2018 0.05 0.48 4.4
As you see, the order of observations is different from the result you wanted (which is typical for PROC SQL and SQL in general when you don't use an ORDER BY clause or certain other language elements), making it harder to check if the result meets your requirements. But if that's not a problem, i.e., "sort order [of the result] doesn't matter," then forget about _wins and _seqno. For the join PROC SQL doesn't need any specific sort order.
Note that table "matching" is not sorted by any obvious unique combination of available sort keys (e.g., INDC3, Type or INDC3, Year, var1 or descending anything). So, to obtain the exact table "matching" from PROC SQL there's almost no other option than to create a new sort key such as _seqno, based on dataset winsorize.
You have typos in your industry_return dataset (CNTSM instead of CNSTM).
Otherwise, this is a classic example for doing a hash lookup:
data winsorize;
input Type $ INDC3 $ Year var1 var2;
datalines;
AXX CNSTM 1994 0.01 0.08
MAS CNSTM 1996 0.05 .
RAF CNSTM 1994 0.07 0.08
AGG CNSTM 2004 0.07 .
CCC CNSTM 1996 0.02 .
RBB ENEGY 1998 0.05 0.88
RFB ENEGY 1999 0.06 0.89
TYB ENEGY 2018 0.05 0.48
;
data industry_return;
input INDC3 $ Year var3;
datalines;
CNSTM 1988 1.2
CNSTM 1989 1.3
CNSTM 1990 1.5
CNSTM 1994 1.1
CNSTM 1996 1.7
CNSTM 2004 1.9
CNSTM 2018 2.1
ENEGY 1988 3.3
ENEGY 1996 3.5
ENEGY 1998 3.9
ENEGY 1999 4.2
ENEGY 2018 4.4
;
data want;
set winsorize;
if _n_ = 1 then do;
length var3 8;
declare hash i (dataset:"industry_return");
i.definekey("INDC3","year");
i.definedata("var3");
i.definedone();
end;
if i.find() ne 0 then var3 = .;
run;
proc print data=want noobs;
run;
Result:
Type INDC3 Year var1 var2 var3 AXX CNSTM 1994 0.01 0.08 1.1 MAS CNSTM 1996 0.05 . 1.7 RAF CNSTM 1994 0.07 0.08 1.1 AGG CNSTM 2004 0.07 . 1.9 CCC CNSTM 1996 0.02 . 1.7 RBB ENEGY 1998 0.05 0.88 3.9 RFB ENEGY 1999 0.06 0.89 4.2 TYB ENEGY 2018 0.05 0.48 4.4
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.