topic Re: IN dataset option in MERGE two datasets. in SAS Programming

IN dataset option in MERGE two datasets.

saslovethemost — Fri, 15 Sep 2017 12:46:49 GMT

I am getting confused with IN= dataset option in MERGE statement with two datasets.

data _NULL_;
    merge txnsfile(IN=txns) dupsfile(IN=dups);
    by key;
    flle extnrpt;

Follwoing is what I want to achieve.

1. If the key is there in both the files then I want the record from txns file.
2. If the key is not thre in txns file but is there in dups file then I want the record from dups file.
3. If the key is there in txns file but not there in dups file then I want it from the txns file.

Thansk for your time.

Neal.

Re: IN dataset option in MERGE two datasets.

Quentin — Fri, 15 Sep 2017 13:09:03 GMT

Merge is typically used to merge two datasets "horizontally", i.e. adding variables.

From your description, it sounds like you may have the same variables in both datasets, is that right? And your goal is to find the KEY values that are in dupsfile but not in txnsfile, and add those records to txnsfile ("vertically"), is that right?

It would be easier to help you if you made added code to make a small sample (~5 records) of txnsfile and dupsfile, and show the output you want.

Re: IN dataset option in MERGE two datasets.

LinusH — Fri, 15 Sep 2017 13:52:13 GMT

If I recall right, the values from the first data set in the merge statement overrides any other values.

So by ordering the merge statement will probably be sufficient. I don't think that the IN= applies to the described logic.

http://support.sas.com/documentation/cdl/en/lrcon/69852/HTML/default/viewer.htm#n1tgk0uanvisvon1r26lc036k0w7.htm

Re: IN dataset option in MERGE two datasets.

Quentin — Fri, 15 Sep 2017 14:17:36 GMT

@LinusH They way SAS handles variable collisions in a merge are tricky. I think it's better to think of them as collisions to be avoided, rather than a tool to be used. In a one-to-one merge with collisions, the variables from the value from the second dataset will overwrite the value for the first dataset. But in a many(left)-to-one(right) merge with collisions, only the first record for a BY group from the many dataset will have its value overwritten, because when the second record is read it will overwrite the value in the PDV from the right dataset. That's an ugly sentence. Also you can have issues caused by colliding variables having different attributes.

I tend to turn on msglevel=i and treat collision notes as errors. I wish there was an option to make this an error, like MERGENOBY.

Here's an example of the many-to-one collision problem which is a surprise to many folks starting to work with the MERGE statement:

data many;
  input id x;
  cards;
1 0
2 0
2 0
3 0
;
run;

data one;
  input id x;
  cards;
2 1
3 1
;
run;

*use a merge to attempt to update the values of x for id=2 and id=3;
*does not produce the desired result;

options msglevel=i;
data want;
  merge many one;
  by id;

  put (id x)(=);
run;

Returns:

118  options msglevel=i;
119  data want;
120    merge many one;
121    by id;
122
123    put (id x)(=);
124  run;

INFO: The variable x on data set WORK.MANY will be overwritten by data set WORK.ONE.
id=1 x=0
id=2 x=1
id=2 x=0
id=3 x=1
NOTE: There were 4 observations read from the data set WORK.MANY.
NOTE: There were 2 observations read from the data set WORK.ONE.
NOTE: The data set WORK.WANT has 4 observations and 2 variables.

Note that the INFO: line in the log might actually lead you to think it's workig like you want "Yayy, the value was over-written!" but on the second record for id=2 the value of x was not overwritten.

Re: IN dataset option in MERGE two datasets.

saslovethemost — Fri, 15 Sep 2017 15:01:00 GMT

data many;
  input id x;
cards; 
1 0 
2 0 
4 0;
run;

data one;
  input id x;
cards; 
2 1 
3 1 
4 1;
run;

I want the ouptut like below.

id x
1  0
2  0
3  1
4  0;

Re: IN dataset option in MERGE two datasets.

Quentin — Fri, 15 Sep 2017 15:13:47 GMT

I'd look into the UPDATE statement rather than merge, e.g.:

data trans;
  input id x;
cards; 
1 0
2 0
4 0
;
run;

data master;
  input id x;
cards; 
2 1
3 1
4 1
;
run;

data want;
  update master trans;
  by id;

  put (id x)(=);
run;

Returns:

165  data want;
166    update master trans;
167    by id;
168
169    put (id x)(=);
170  run;

id=1 x=0
id=2 x=0
id=3 x=1
id=4 x=0
NOTE: There were 3 observations read from the data set WORK.MASTER.
NOTE: There were 3 observations read from the data set WORK.TRANS.
NOTE: The data set WORK.WANT has 4 observations and 2 variables.

Re: IN dataset option in MERGE two datasets.

Haikuo — Fri, 15 Sep 2017 15:36:38 GMT

I second @Quentin, 'Update' is the way to go. As an unorthodox alternative, you could also 'set' interlace to have the job done:

data trans;
  input id x;
cards; 
1 0
2 0
4 0
;
run;

data master;
  input id x;
cards; 
2 1
3 1
4 1
;
run;


DATA WANT;
SET TRANS MASTER;
BY ID;
IF FIRST.ID;
RUN;

Re: IN dataset option in MERGE two datasets.

saslovethemost — Fri, 15 Sep 2017 15:37:21 GMT

Looks good, can we achive the same using MERGE statements, thanks in advance.

Re: IN dataset option in MERGE two datasets.

Tom — Fri, 15 Sep 2017 15:38:19 GMT

That is not a many to one merge as each ID only occurs once per source dataset. Also it does not look like you want to do a merge at all.

This code wil interleave the rows by ID and select the first row for each ID.

That will produce your desired output.

data first;
  input id x;
cards; 
1 0 
2 0 
4 0
;

data second;
  input id x;
cards; 
2 1 
3 1 
4 1
;

data want ;
  set first second ;
  by id ;
  if first.id;
run;

Re: IN dataset option in MERGE two datasets.

Haikuo — Fri, 15 Sep 2017 15:44:54 GMT

You could, but be aware that you will need to deal each variables one by one if you have many variables, which is ugly coding. So why not 'update'? is this a homework question?

data trans;
  input id x;
cards; 
1 0
2 0
4 0
;
run;

data master;
  input id x;
cards; 
2 1
3 1
4 1
;
run;


DATA WANT(drop=_x);
merge TRANS (in=a) MASTER(in=b rename=x=_x);
BY ID;
IF not a and b then x=_x;
RUN;