BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Phil_NZ
Barite | Level 11

Hi all SAS Users,

 

From my understanding, we join or match two datasets mainly by using the proc SQL (inner, right, and left-join).

However, my case below does not work on that way.

I have two dataset called: winsorize and industry_return. The simplified versions of these two datasets are as below:

winsorize

Type    INDC3       Year   var1    var2
AXX     CNSTM       1994   0.01    0.08
MAS     CNSTM       1996   0.05    .
RAF     CNSTM       1994   0.07    0.08
AGG     CNSTM       2004   0.07    .
CCC     CNSTM       1996   0.02    .
RBB     ENEGY       1998   0.05    0.88
RFB     ENEGY       1999   0.06    0.89
TYB     ENEGY       2018   0.05    0.48

 

industry_return

INDC3     Year    var3
CNSTM     1988    1.2
CNTSM     1989    1.3
CNTSM     1990    1.5
CNTSM     1994    1.1
CNTSM     1996    1.7
CNTSM     2004    1.9
CNTSM     2018    2.1
ENEGY     1988    3.3
ENEGY     1996    3.5
ENEGY     1998    3.9
ENEGY     1999    4.2
ENEGY     2018    4.4

 

I want a results called "matching"

 

Type    INDC3          Year   var1    var2       var3
AXX     CNSTM          1994   0.01    0.08        1.1
MAS     CNSTM          1996   0.05    .           1.7
RAF     CNSTM          1994   0.07    0.08        1.1
AGG     CNSTM          2004   0.07    .           1.9
CCC     CNSTM          1996   0.02    .           1.7
RBB     ENEGY          1998   0.05    0.88        3.9
RFB     ENEGY          1999   0.06    0.89        4.2
TYB     ENEGY          2018   0.05    0.48        4.4

And I see that this match seems not to be able to be done by using three join styles of proc sql.

 

Could you please suggest to me how to sort it out?

 

Warmest regards.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.
1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hi @Phil_NZ,

 

Let me quote your initial post:


@Phil_NZ wrote:

I want a results called "matching"

 

Type    INDC3          Year   var1    var2       var3
AXX     CNSTM          1994   0.01    0.08        1.1
MAS     CNSTM          1996   0.05    .           1.7
RAF     CNSTM          1994   0.07    0.08        1.1
AGG     CNSTM          2004   0.07    .           1.9
CCC     CNSTM          1996   0.02    .           1.7
RBB     ENEGY          1998   0.05    0.88        3.9
RFB     ENEGY          1999   0.06    0.89        4.2
TYB     ENEGY          2018   0.05    0.48        4.4

When I used dataset winsorize instead of view _wins in the PROC SQL step (and no ORDER BY clause and no reference to variable _seqno) I obtained:

Type    INDC3    Year    var1    var2    var3

RAF     CNSTM    1994    0.07    0.08     1.1
AXX     CNSTM    1994    0.01    0.08     1.1
CCC     CNSTM    1996    0.02     .       1.7
MAS     CNSTM    1996    0.05     .       1.7
RBB     ENEGY    1998    0.05    0.88     3.9
RFB     ENEGY    1999    0.06    0.89     4.2
AGG     CNSTM    2004    0.07     .       1.9
TYB     ENEGY    2018    0.05    0.48     4.4

As you see, the order of observations is different from the result you wanted (which is typical for PROC SQL and SQL in general when you don't use an ORDER BY clause or certain other language elements), making it harder to check if the result meets your requirements. But if that's not a problem, i.e., "sort order [of the result] doesn't matter," then forget about _wins and _seqno. For the join PROC SQL doesn't need any specific sort order.

 

Note that table "matching" is not sorted by any obvious unique combination of available sort keys (e.g., INDC3, Type or INDC3, Year, var1 or descending anything). So, to obtain the exact table "matching" from PROC SQL there's almost no other option than to create a new sort key such as _seqno, based on dataset winsorize.

View solution in original post

8 REPLIES 8
PaigeMiller
Diamond | Level 26

Please explain the logic that allows you to go from the input data set to the output data set.

 

Could you also please provide the data as DATA step code instead of columns of data?

--
Paige Miller
LinusH
Tourmaline | Level 20

Perhaps I'm missing something when skimming through your data, but this should work with a single join, using two columns in the join condition (INDC3 and Year)?

Data never sleeps
FreelanceReinh
Jade | Level 19

Hi @Phil_NZ,

 

I agree with @LinusH. Only if the sort order of matching must be the same as that of winsorize and the real data don't contain a suitable sort key, you may need to add a sequence number in a preliminary DATA step as shown below.

data winsorize;
input Type $ INDC3 $ Year var1 var2;
cards;
AXX CNSTM 1994 0.01 0.08
MAS CNSTM 1996 0.05 .
RAF CNSTM 1994 0.07 0.08
AGG CNSTM 2004 0.07 .
CCC CNSTM 1996 0.02 .
RBB ENEGY 1998 0.05 0.88
RFB ENEGY 1999 0.06 0.89
TYB ENEGY 2018 0.05 0.48
;

data industry_return;
input INDC3 $ Year var3;
cards;
CNSTM 1988 1.2
CNTSM 1989 1.3
CNTSM 1990 1.5
CNTSM 1994 1.1
CNTSM 1996 1.7
CNTSM 2004 1.9
CNTSM 2018 2.1
ENEGY 1988 3.3
ENEGY 1996 3.5
ENEGY 1998 3.9
ENEGY 1999 4.2
ENEGY 2018 4.4
;

data _wins / view=_wins;
set winsorize;
_seqno=_n_;
run;

proc sql;
create table matching(drop=_seqno) as
select a.*, var3
from _wins a left join industry_return b
on spedis(a.indc3, b.indc3)<=10 & a.year=b.year
order by _seqno;
quit;

(Increase the "tolerance" 10 in the ON condition if there are more severe typos (?) in INDC3 than just "CNSTM" vs. "CNTSM", but make sure that no actual mismatches are tolerated.)

Phil_NZ
Barite | Level 11

Hi @FreelanceReinh , @Kurt_Bremser , @LinusH , and @PaigeMiller 

 

Sorry it is totally my typos, so when I edit my data consistently, whether the changed line in the code below is correct, I just wanna cross-check:

Simplistically speaking, yes, I match based on INDC3 and Year.

data winsorize;
input Type $ INDC3 $ Year var1 var2;
cards;
AXX CNSTM 1994 0.01 0.08
MAS CNSTM 1996 0.05 .
RAF CNSTM 1994 0.07 0.08
AGG CNSTM 2004 0.07 .
CCC CNSTM 1996 0.02 .
RBB ENEGY 1998 0.05 0.88
RFB ENEGY 1999 0.06 0.89
TYB ENEGY 2018 0.05 0.48
;

data industry_return;
input INDC3 $ Year var3;
cards;
CNSTM 1988 1.2
CNSTM 1989 1.3
CNSTM 1990 1.5
CNSTM 1994 1.1
CNSTM 1996 1.7
CNSTM 2004 1.9
CNSTM 2018 2.1
ENEGY 1988 3.3
ENEGY 1996 3.5
ENEGY 1998 3.9
ENEGY 1999 4.2
ENEGY 2018 4.4
;

data _wins / view=_wins;
set winsorize;
_seqno=_n_;
run;

proc sql;
create table matching(drop=_seqno) as
select a.*, var3
from _wins a left join industry_return b
on a.indc3 = b.indc3 & a.year=b.year /*changed line*/
order by _seqno;
quit;

Apart from that, I do not know what is the role and purpose of these lines of code in my case???

data _wins / view=_wins;

and 

_seqno=_n_;

Warm regards. 

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.
FreelanceReinh
Jade | Level 19

Yes, with consistent data, of course, you don't need functions (like SPEDIS) supporting "fuzzy" comparisons and the criterion regarding INDC3 simplifies to a.indc3=b.indc3. I introduced the sequence number _seqno just to ensure that datasets winsorize and matching have the same sort order. If sort order doesn't matter, you should omit the DATA step creating the view _wins and use dataset winsorize instead in the PROC SQL step (as this will improve performance and simplify the code):

proc sql;
create table matching as
select a.*, var3
from winsorize a left join industry_return b
on a.indc3=b.indc3 & a.year=b.year;
quit;

 

I used a view rather than a dataset to create the temporary sort key _seqno in order to save disk space (of about the size occupied by dataset winsorize).

Phil_NZ
Barite | Level 11

Hi @FreelanceReinh 

 

Thank you very much for your explanation.

Can I ask, so "same sort of order" means similar in "ascending or descending" order , is not it?

I could not get the idea " sort order doesn't matter", because I think SAS will automatically match from Table b with b based on the match in "INDC3" and "year". And in PROC SQL, we do not need to care about the sort or else?

Please let me know if I fall into any fallacy.

 

Cheers.

P/S: I also have some "trials and errors"test, when applying your code with _seqno, the order of observations in table matching following that of winsorize. Without applying _seqno, the order of dataset matching similar to that of industry_return. Interesting to me.

Thank you for your help, have a fabulous and productive day! I am a novice today, but someday when I accumulate enough knowledge, I can help others in my capacity.
FreelanceReinh
Jade | Level 19

Hi @Phil_NZ,

 

Let me quote your initial post:


@Phil_NZ wrote:

I want a results called "matching"

 

Type    INDC3          Year   var1    var2       var3
AXX     CNSTM          1994   0.01    0.08        1.1
MAS     CNSTM          1996   0.05    .           1.7
RAF     CNSTM          1994   0.07    0.08        1.1
AGG     CNSTM          2004   0.07    .           1.9
CCC     CNSTM          1996   0.02    .           1.7
RBB     ENEGY          1998   0.05    0.88        3.9
RFB     ENEGY          1999   0.06    0.89        4.2
TYB     ENEGY          2018   0.05    0.48        4.4

When I used dataset winsorize instead of view _wins in the PROC SQL step (and no ORDER BY clause and no reference to variable _seqno) I obtained:

Type    INDC3    Year    var1    var2    var3

RAF     CNSTM    1994    0.07    0.08     1.1
AXX     CNSTM    1994    0.01    0.08     1.1
CCC     CNSTM    1996    0.02     .       1.7
MAS     CNSTM    1996    0.05     .       1.7
RBB     ENEGY    1998    0.05    0.88     3.9
RFB     ENEGY    1999    0.06    0.89     4.2
AGG     CNSTM    2004    0.07     .       1.9
TYB     ENEGY    2018    0.05    0.48     4.4

As you see, the order of observations is different from the result you wanted (which is typical for PROC SQL and SQL in general when you don't use an ORDER BY clause or certain other language elements), making it harder to check if the result meets your requirements. But if that's not a problem, i.e., "sort order [of the result] doesn't matter," then forget about _wins and _seqno. For the join PROC SQL doesn't need any specific sort order.

 

Note that table "matching" is not sorted by any obvious unique combination of available sort keys (e.g., INDC3, Type or INDC3, Year, var1 or descending anything). So, to obtain the exact table "matching" from PROC SQL there's almost no other option than to create a new sort key such as _seqno, based on dataset winsorize.

Kurt_Bremser
Super User

You have typos in your industry_return dataset (CNTSM instead of CNSTM).

 

Otherwise, this is a classic example for doing a hash lookup:

data winsorize;
input Type $ INDC3 $ Year var1 var2;
datalines;
AXX     CNSTM       1994   0.01    0.08
MAS     CNSTM       1996   0.05    .
RAF     CNSTM       1994   0.07    0.08
AGG     CNSTM       2004   0.07    .
CCC     CNSTM       1996   0.02    .
RBB     ENEGY       1998   0.05    0.88
RFB     ENEGY       1999   0.06    0.89
TYB     ENEGY       2018   0.05    0.48
;

data industry_return;
input INDC3 $  Year var3;
datalines;
CNSTM     1988    1.2
CNSTM     1989    1.3
CNSTM     1990    1.5
CNSTM     1994    1.1
CNSTM     1996    1.7
CNSTM     2004    1.9
CNSTM     2018    2.1
ENEGY     1988    3.3
ENEGY     1996    3.5
ENEGY     1998    3.9
ENEGY     1999    4.2
ENEGY     2018    4.4
;

data want;
set winsorize;
if _n_ = 1 then do;
  length var3 8;
  declare hash i (dataset:"industry_return");
  i.definekey("INDC3","year");
  i.definedata("var3");
  i.definedone();
end;
if i.find() ne 0 then var3 = .;
run;

proc print data=want noobs;
run;

Result:

Type	INDC3	Year	var1	var2	var3
AXX	CNSTM	1994	0.01	0.08	1.1
MAS	CNSTM	1996	0.05	.	1.7
RAF	CNSTM	1994	0.07	0.08	1.1
AGG	CNSTM	2004	0.07	.	1.9
CCC	CNSTM	1996	0.02	.	1.7
RBB	ENEGY	1998	0.05	0.88	3.9
RFB	ENEGY	1999	0.06	0.89	4.2
TYB	ENEGY	2018	0.05	0.48	4.4

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 843 views
  • 6 likes
  • 5 in conversation