I want to merging all obs in the dataset TEST_STATS to each obs in the dataset TEST_NOMATCH if they match by the NAME_CLEANED variable. I have attached the sample datasets so they only have obs with 1 value for the NAME_CLEANED var to test. My real dataset has many values for this NAME_CLEANED variable. If TEST_STATS has 4 obs and TEST_NOMATCH has 40 obs, what I want to have is a new dataset with160(40x4) obs.
Originally, I did this using a DATA step with MERGE. However, I found out that they dont really do what I want.
data test_merged; merge test_nomatch test_stats; by name_cleaned; run;
The output has only 40 obs, where the values change for the first 4 obs and then remain the same for the last 36 obs.
I can achieve what I want by using PROC SQL as followed:
proc sql;
create table test_merged_sql as
select a.*, b.*
from test_nomatch as a, test_stats as b
where a.name_cleaned= b.name_cleaned;
run;quit;
My question: how can I achieve what I want using a DATA step? and What went wrong with my original DATA step?
The reason I would want to use a DATA step is that I can separate between the obs that matches and the ones that do not match. How can I do this with PROC SQL?
Please post example data that matches your real-data as close as possible. And post data as data-step-code, that makes it easier to use.
is there a way to export the dataset as a data-step code? or do I have to do this manually?
SAS data sets have different internal format depending OS and SAS version,
thus attaching a sas data-set is not desirable.
Here are the datasets:
data test_stats;
input constituent_ticker: $5. secid: $20. name_cleaned: $100.;
datalines;
AA 0P00018WO0 ALCOA
AA 0P0000008J ALCOA
AA 0P0000A57O ALCOA
AAI 0P00006XU1 ALCOA
run;
data test_nomatch;
input fund_ticker: $5. name: $100. name_cleaned: $100.;
datalines;
ACWI ALCOA ALCOA
DGT ALCOA ALCOA
EUSA ALCOA ALCOA
GRES ALCOA ALCOA
HAP ALCOA ALCOA
IGE ALCOA ALCOA
ITOT ALCOA ALCOA
IUSV ALCOA ALCOA
IVE ALCOA ALCOA
IVV ALCOA ALCOA
IWB ALCOA ALCOA
IWD ALCOA ALCOA
IWR ALCOA ALCOA
IWS ALCOA ALCOA
IWV ALCOA ALCOA
IYM ALCOA ALCOA
IYY ALCOA ALCOA
JHML ALCOA ALCOA
JHMM ALCOA ALCOA
JKI ALCOA ALCOA
MXI ALCOA ALCOA
PBP ALCOA ALCOA
PHDG ALCOA ALCOA
PICK ALCOA ALCOA
PRF ALCOA ALCOA
PYZ ALCOA ALCOA
RWL ALCOA ALCOA
SIZE ALCOA ALCOA
SPHB ALCOA ALCOA
SPLG ALCOA ALCOA
SPTM ALCOA ALCOA
SPY ALCOA ALCOA
SPYV ALCOA ALCOA
SPYX ALCOA ALCOA
TOK ALCOA ALCOA
URTH ALCOA ALCOA
VLUE ALCOA ALCOA
XLB ALCOA ALCOA
XME ALCOA ALCOA
XMVM ALCOA ALCOA
run;
Hi,
It seems that this is what you may need:
data Test_nomatch;
input (constituent_ticker SecID name_cleaned) ($);
cards;
abc 1 A
def 2 B
ghi 3 C
abc 4 A
def 5 B
ghi 6 C
123 1 D
456 2 E
678 3 F
;
run;
data Test_stats;
input (fund_ticker Name name_cleaned) ($);
cards;
x a A
y b B
y c C
t a A
s b B
v c C
;
run;
data matched notmatched(drop=fund_ticker Name);
if 0 then set Test_nomatch Test_stats;
declare hash TS(dataset:"Test_stats", multidata:"Y");
TS.defineKey("name_cleaned");
TS.defineData("fund_ticker","Name");
TS.defineDone();
do until(EOF);
set Test_nomatch end=EOF;
if TS.check() = 0 then
do;
TS.reset_dup();
do while(TS.do_over() = 0);
output matched;
end;
end;
else
/*call missing(fund_ticker,Name);*/
output notmatched;
end;
stop;
run;
All the best;
Bart
Your notmatched dataset will have incorrect non-missing values for fund_ticker and name; you should set those to missing before the OUTPUT, or drop the variables with a dataset option in the first place.
@Kurt_Bremser thanks a million! I've updated the code.
"Given enough eyeballs, all bugs are shallow." (Linus's law, formulated by Eric S. Raymond)
I'm calling it "programming on 4 hands" and I'm practicing it as often as I can during production programing. 😉
Bart
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.