Re: one to many merge does not work as expected

somebody · Posted 06-25-2020 01:38 AM

I want to merging all obs in the dataset TEST_STATS to each obs in the dataset TEST_NOMATCH if they match by the NAME_CLEANED variable. I have attached the sample datasets so they only have obs with 1 value for the NAME_CLEANED var to test. My real dataset has many values for this NAME_CLEANED variable. If TEST_STATS has 4 obs and TEST_NOMATCH has 40 obs, what I want to have is a new dataset with160(40x4) obs.

Originally, I did this using a DATA step with MERGE. However, I found out that they dont really do what I want.

data test_merged; merge test_nomatch test_stats; by name_cleaned; run;

The output has only 40 obs, where the values change for the first 4 obs and then remain the same for the last 36 obs.

I can achieve what I want by using PROC SQL as followed:

proc sql;
create table test_merged_sql as
select a.*, b.*
from test_nomatch as a, test_stats as b
where a.name_cleaned= b.name_cleaned;
run;quit;

My question: how can I achieve what I want using a DATA step? and What went wrong with my original DATA step?

somebody · Posted 06-25-2020 01:48 AM

The reason I would want to use a DATA step is that I can separate between the obs that matches and the ones that do not match. How can I do this with PROC SQL?

andreas_lds · Posted 06-25-2020 01:52 AM

Please post example data that matches your real-data as close as possible. And post data as data-step-code, that makes it easier to use.

somebody · Posted 06-25-2020 02:20 AM

is there a way to export the dataset as a data-step code? or do I have to do this manually?

sustagens · Posted 06-25-2020 02:28 AM

How to convert datasets to data steps

Shmuel · Posted 06-25-2020 02:10 AM

SAS data sets have different internal format depending OS and SAS version,

thus attaching a sas data-set is not desirable.

somebody · Posted 06-25-2020 02:31 AM

Here are the datasets:

data test_stats;
   input constituent_ticker: $5. secid: $20. name_cleaned: $100.;
datalines;

AA	0P00018WO0	ALCOA
AA	0P0000008J	ALCOA
AA	0P0000A57O	ALCOA
AAI	0P00006XU1	ALCOA

run;


data test_nomatch;
   input fund_ticker: $5. name: $100. name_cleaned: $100.;
datalines;
ACWI	ALCOA	ALCOA
DGT	ALCOA	ALCOA
EUSA	ALCOA	ALCOA
GRES	ALCOA	ALCOA
HAP	ALCOA	ALCOA
IGE	ALCOA	ALCOA
ITOT	ALCOA	ALCOA
IUSV	ALCOA	ALCOA
IVE	ALCOA	ALCOA
IVV	ALCOA	ALCOA
IWB	ALCOA	ALCOA
IWD	ALCOA	ALCOA
IWR	ALCOA	ALCOA
IWS	ALCOA	ALCOA
IWV	ALCOA	ALCOA
IYM	ALCOA	ALCOA
IYY	ALCOA	ALCOA
JHML	ALCOA	ALCOA
JHMM	ALCOA	ALCOA
JKI	ALCOA	ALCOA
MXI	ALCOA	ALCOA
PBP	ALCOA	ALCOA
PHDG	ALCOA	ALCOA
PICK	ALCOA	ALCOA
PRF	ALCOA	ALCOA
PYZ	ALCOA	ALCOA
RWL	ALCOA	ALCOA
SIZE	ALCOA	ALCOA
SPHB	ALCOA	ALCOA
SPLG	ALCOA	ALCOA
SPTM	ALCOA	ALCOA
SPY	ALCOA	ALCOA
SPYV	ALCOA	ALCOA
SPYX	ALCOA	ALCOA
TOK	ALCOA	ALCOA
URTH	ALCOA	ALCOA
VLUE	ALCOA	ALCOA
XLB	ALCOA	ALCOA
XME	ALCOA	ALCOA
XMVM	ALCOA	ALCOA
run;

yabwon · Posted 06-25-2020 02:35 AM

Hi,

It seems that this is what you may need:

data Test_nomatch;
input (constituent_ticker SecID name_cleaned) ($);
cards;
abc 1 A
def 2 B
ghi 3 C
abc 4 A
def 5 B
ghi 6 C
123 1 D
456 2 E
678 3 F
;
run;

data Test_stats;
input (fund_ticker Name name_cleaned) ($);
cards;
x a A
y b B
y c C
t a A
s b B
v c C
;
run;



data matched notmatched(drop=fund_ticker Name);
  if 0 then set Test_nomatch Test_stats;
  declare hash TS(dataset:"Test_stats", multidata:"Y");
  TS.defineKey("name_cleaned");
  TS.defineData("fund_ticker","Name");
  TS.defineDone();

  do until(EOF);
    set Test_nomatch end=EOF;

    if TS.check() = 0 then
    do;
      TS.reset_dup();
      do while(TS.do_over() = 0);
        output matched;
      end;
    end;
    else
      /*call missing(fund_ticker,Name);*/
      output notmatched;
  end;

stop;
run;

All the best;

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Kurt_Bremser · Posted 06-25-2020 04:16 AM

Your notmatched dataset will have incorrect non-missing values for fund_ticker and name; you should set those to missing before the OUTPUT, or drop the variables with a dataset option in the first place.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

yabwon · Posted 06-25-2020 04:31 AM

@Kurt_Bremser thanks a million! I've updated the code.

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Kurt_Bremser · Posted 06-25-2020 04:36 AM

"Given enough eyeballs, all bugs are shallow." (Linus's law, formulated by Eric S. Raymond)

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

yabwon · Posted 06-25-2020 04:43 AM

I'm calling it "programming on 4 hands" and I'm practicing it as often as I can during production programing. 😉

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Classroom Training Available!