SAS Programming

HeatherNewton · Posted 10-24-2022 09:09 PM

Hi am i right to say this, when a sas merge is many to one or one to many, oracle sql full join works. If many to many, then no normal sql can do that. I will have to create own sql command. I am thinking if the best way is use full outer join then set conditions to remove excess observations.

Any insights.. my friend suggests oracle merge into might do the trick, any tried?

Tom · Posted 10-24-2022 09:30 PM

provide a description or link to "merge into" from oracle.

Provide more details on the type of merge you are trying to replicate.

Are there any non-key common variables between the two datasets?

Do you already have an counter variable in each dataset that can be used to insure the records are match in the proper order?

Also how much does it matter whether or not the SAS merge is replicated 100%? Perhaps there are aspects of the result that you are not using that mean you don't actually need to replicate those things.

SASKiwi · Posted 10-24-2022 09:42 PM

@HeatherNewton wrote:
Hi am i right to say this, when a sas merge is many to one or one to many, oracle sql full join works. If many to many, then no normal sql can do that. I will have to create own sql command. I am thinking if the best way is use full outer join then set conditions to remove excess observations.

Any insights.. my friend suggests oracle merge into might do the trick, any tried?

You keep searching for an SQL solution that will work exactly the same as a SAS MERGE - believe me there isn't any. MERGE does coalescing of columns with the same name from each table without explicit coding, in SQL you have to code it.

Also "normal" SQL can do many-to-many joins without a problem. SAS MERGE can't unless you get complicated and use other statements.

The sooner you stop thinking of SQL in terms of DATA step processing, the sooner you will start to make progress.

Kurt_Bremser · Posted 10-25-2022 04:25 AM

You keep asking the same thing over and over again, although you have gotten many answers from all the senior folks here. This is somehow disturbing.

Take a few (think: thousands!) steps back and look at the big picture:

what is the data I get initially?
what is the information needed from this data?

Once this process or these processes are sufficiently defined, you can go to work to implement them in Oracle.

Any undocumented process in SAS, where nobody can tell you what it does (from the business/analytics side, without inspecting the SAS code) is most probably not worth bothering about anyway.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

HeatherNewton · Posted 10-26-2022 07:26 AM

Sorry guys! I was hoping writing out sas match merge logic is easier than understanding what program is actually doing to the data. Also want to have same output for Comparison with production data.

In my project, i have lots and lots of sas programs with match merge which i cant tell why merge is used. Eg in a credit card project, on a daily basis, we merge the card application info dataset (which is accumulated daily) with new daily application data.

Eg data application;
merge application dataA dataB;
by application_ref_no;
run;

Basically from physical data, they are many to many. What i dont understand is why merge is applicable as only last entry of dataA and dataB would be used for the same application_ref_no..

So i cant tell if using other join could work and if i used full join i will have much bigger output.

Tom · Posted 10-26-2022 08:48 AM

You have THREE datasets that each have multiple observations for the same APPLICATION number?

What would that even mean?

What do you mean by LAST one?

SQL does not have a concept of LAST. If you have an ordering variable you can mimic it by selecting the MAX.

Is there some other variable (like a date?) that defines the order?

If you only want the last one then subset each input dateset to the last observation per APPLICATION and then it is a simple one-to-one (to-one) join.

HeatherNewton · Posted 10-26-2022 09:43 AM

I dont mean I want the last one only but for sas match merge if I have eg

data A data B

1 a 1 d

1 b 1 e

1 c

result would be

1 a d

1 b e

1 c e

I mean the one extra on data A would merge to the last one in data B which is 1 e

this is a very strange kind of merge and when is this ever be useful

Kurt_Bremser · Posted 10-26-2022 10:17 AM

This is what a data step MERGE does. Keep in mind that the purest form of a MERGE has no BY statement and just sets two datasets side by side. This is also why the MERGE is called that and not JOIN.

The inner simplicity of the data step is also the reason why it performs so well, but the SAS coder needs to be aware of every aspect of its working.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Tom · Posted 10-26-2022 10:37 AM

That is just how MERGE works. The reason E is "carried forward" is because there are no more observations to read that would change its value. Remember this feature is WHY the one to many merge works. The variables contributed by the one observation dataset are never changed until the next BY group is reached.

But the question was WHY would YOU do that with bank application data? What is the code trying to do?

Perhaps the code did not expect there to be multiple observations for the same by group from more than one input dataset. Perhaps you are trying to use the code on the wrong type of data.

Reeza · Posted 10-26-2022 03:51 PM

@HeatherNewton wrote:

this is a very strange kind of merge and when is this ever be useful

Bingo. You will rarely every see a many to many merge and the few I've come across are actually mistakes where the data ended up being many to many, after time when it originally was not and someone missing that. I would take a look at your use cases and see if you have any many to many joins in data step and I suspect you do not and thus this is a moot point and your SQL joins will mostly be fine.

In my experience the issues with converting data step to SQL usually originate around first/last/by group processing and retain/lag logic that uses row order.

HeatherNewton · Posted 11-04-2022 03:07 AM

I have now looked into my datasets and realised the below points:

Normally the merge are done on of two datasets with the same 30 variables by 2 to 4 variables on a daily basis. Wdata.pil_master seems accumulative and _pil_master which is a delta file of each day.

Data wdata.pil_master;
    merge wdata.pil_master _pil_master;
    by org_code account_no plan_num ref_no;
run;

so am i right to say the followings:

if the entries in _pil_master is not new, it will amend those in wdata.pil_master

if the entries in _pil_master is new, it will be added to wdata.pil_master

is my understanding of merge in this case, correct?

Kurt_Bremser · Posted 11-04-2022 03:17 AM

What does the log say?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

HeatherNewton · Posted 11-04-2022 03:52 AM

I have no log as I dont have real data to test yet

SASKiwi · Posted 11-07-2022 09:00 PM

You can answer your own questions by creating some test data and running your MERGE code on that.

mkeintz · Posted 10-26-2022 03:32 PM

MATCH MERGE crucially depends on observation order, which SQL (to my knowledge) wholly ignores.

Now, I think it is possible to replicate the match merge using the SQL code below. But it must be preceded by a pair of data step, to put sequence numbers for every obs in BY group

/* Make datasets A and B such that:
    ID=1 is one      to one
    ID=2 is one      to many (2)
    ID=3 is many (2) to one
    ID=4 is none     to many (2)
    ID=5 is many (2) to none
    ID=6 is many (3) to many (3) with identical freqs (N(a)=N(b)
    ID=7 is many (3) to many (2) with N(a)>N(b)
    ID=8 is many (2) to many (3) with N(a)<N(b)  */
data a;
  input id atext $7.  @@;
datalines;
1 Line1.1
2 Line2.1             
3 Line3.1  3 Line3.2

5 Line5.1  5 Line5.2  
6 Line6.1  6 Line6.2  6 Line6.3
7 Line7.1  7 Line7.2  7 Line7.3
8 Line8.1  8 Line8.2
run; 

data b;
  input id btext $7.  @@;
datalines;
1 Line1.1  
2 Line2.1  2 Line2.2  
3 Line3.1
4 Line4.1  4 Line4.2

6 Line6.1  6 Line6.2  6 Line6.3
7 Line7.1  7 Line7.2
8 Line8.1  8 Line8.2  8 Line8.3
run;


data viewa /view=viewa;
  set a;
  by id;
  if first.id then _seq1=1;
  else _seq1+1;
run;

data viewb/view=viewb;
  set b;
  by id;
  if first.id then _seq2=1;
  else _seq2+1;
run;

proc sql;
  create table want_sql
  as select coalesce(a.id,b.id) as id, atext,btext, _seq1, _seq2
  from  viewa as a full join viewb as b
  on  (a.id=b.id)
  group by calculated id
  having a._seq1=b._seq2  
      or a._seq1>max(b._seq2) and _seq2=max(b._seq2)
      or b._seq2>max(a._seq1) and _seq1=max(a._seq1)
  ;
quit;

The strategy here is to do exact matches on the BY variable (ID) and the calculated _SEQ variable. Then, for whichever data set has a higher frequency for the BY-group-in-hand, match it to the maximum _SEQ is the "smaller" dataset.

I think that this is a way to show SQL programmers what a simple match-merge does.

You can test the results of the above via:

proc sort data=want_sql;
  by id _seq1 _seq2;
run;

data want_merge;
  merge a b;
  by id;
run;

proc compare base=want_merge compare=want_sql (drop=_seq1 _seq2) notes noprint;
run;

The PROC COMPARE reports:

NOTE: No unequal values were found. All values compared are exactly equal.
NOTE: The data sets WORK.WANT_MERGE and WORK.WANT_SQL are equal in all respects.
NOTE: There were 18 observations read from the data set WORK.WANT_MERGE.
NOTE: There were 18 observations read from the data set WORK.WANT_SQL.

And while adding a third data set C to the match-merge approach is trivial

data want_merge;
  merge a b c;
  by id;
run;

I do not want to think about replicating it in a single SELECT expression

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Programming

Replicate match merge with Oracle SQL

Re: Replicate match metge with oracle sql

Re: Replicate match metge with oracle sql

Re: Replicate match merge with Oracle SQL

Re: Replicate match merge with Oracle SQL

Re: Replicate match merge with Oracle SQL

Re: Replicate match merge with Oracle SQL

Re: Replicate match merge with Oracle SQL

Re: Replicate match merge with Oracle SQL

Re: Replicate match merge with Oracle SQL

Re: Replicate match merge with Oracle SQL

Re: Replicate match merge with Oracle SQL

Re: Replicate match merge with Oracle SQL

Re: Replicate match merge with Oracle SQL

Re: Replicate match merge with Oracle SQL

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...