Re: Many-TO-ONE merge statment

erickdt · Posted 05-25-2020 02:57 PM

hi everyone, i've try to do merge in sas, however i get a strange error:

i have multiples lines in data A and only one in data B

I've used the follow solution:

DATA BASEDEST1 (DROP=FL_VIEW RENAME=(FL_VIEW_1 = FL_VIEW));
MERGE teste1 (WHERE=(FL_VIEW EQ .) IN= A)
TESTE2 (IN=B);
BY CPFCNPJ COD_PRODUTO; IF A;
IF A AND B THEN FL_VIEW_1 = 1; ELSE FL_VIEW_1 = .;
RUN;

I got the "match" however only the first line keep the value, the following lines keep still blank.

Is important to say, I've all coulumns from B in data A.... when i drop it before merge i got the right result, is it possible to get the right values without drop columns?

Tom · Posted 05-25-2020 03:09 PM

Please post example data as I am not sure what you mean by blank.

Is it possible you are looking at the value of a non key variable that exists in both datasets? If so when it reads the next observations from the "many" dataset it will overwrite the value it read from the "one" dataset. If you want the value from the unique dataset then drop the duplicate variable (or rename it) from the non-unique dataset.

Try this program:

proc sort data=sashelp.class out=many;
  by sex name ;
run;

data one; 
  set many;
  by sex;
  if first.sex;
run;

data test1;
   merge many(in=in_many) one(in=in_one);
   by sex;
run;

proc print;
run;

data test2;
  merge many(in=in_many) one(in=in_one keep=sex name rename=(name=first_name));
  by sex;
run;

proc print;
run;

erickdt · Posted 05-25-2020 03:28 PM

Sorry tom, follow my data:

obs: i don't know how to create "blank field" consider '' as blank, lol

data a;
input CPFCNPJ COD_PRODUTO FL_VIEW FL_CANAL:$9.;
datalines;
2020 1326 . ''
2020 1326 . ''
;
data b;
input CPFCNPJ COD_PRODUTO FL_CANAL$;
datalines;
2020 1326 EDT
;

DATA BASEDEST1 (DROP=FL_VIEW RENAME=(FL_VIEW_1 = FL_VIEW));
MERGE A (WHERE=(FL_VIEW EQ .) IN= A)
B (IN=B);
BY CPFCNPJ COD_PRODUTO; IF A;
IF A AND B THEN FL_VIEW_1 = 1; ELSE FL_VIEW_1 = .;
RUN;

@Tom is important to me to not drop variables, because i've A LOT of non key variables (FL_CANAL is just an example) that i need to fill in data A

Tom · Posted 05-25-2020 05:01 PM

That does not make sense. If you want the value from A to always win then whatever is in B doesn't matter. So just drop the variables from B and keep the variables in A.

erickdt · Posted 05-25-2020 07:35 PM

as i said... I've a lot of variables non key in my data A... because I've another merge before this step using another "key" to get the same fields...

Only first line got the values don't make sense....

thanks in advance...

GGO · Posted 05-25-2020 09:07 PM

Erickdt - This is hard for me to follow, but I think that Tom is right.

I think you need a conditional merge - meaning that you want to preserve non-missing values in A, despite merging on same-name vars from B. In case the var in A is missing, you want to conditionally fill in with data from B.

If that is what you need, I don't think you can merge.

I think you need to rename all same-name vars (at least in B).

Then in a data step you can conditionally fill in missing values in A.

For example, pseudo-code (assuming you rename vars in both A and B):

if missing(var1_from_a) then var1_from_a = var1_from_b;

Array processing can reduce this to little more than that one line wrapped inside a loop from 1 to dim(array_of_common_vars). If you have a mix of CHAR and NUM common vars, I think you'll need two arrays - one for CHAR vars; one for NUM vars.

Bottom line: I don't think you want a simple many-to-one merge. I think you want to merge on NEW vars, and then conditionally populate primary data (A) from a secondary source (B).

Perhaps others have better insights.

erickdt · Posted 05-25-2020 09:25 PM

thanks @GGO and @Tom...
in my question I've also put the solution (drop columns).
as i said I haven't idea why only first line retain the value of variables

I thought merge was more effective and efficient than use a proc sql with left join.

I would my result as:

2020 1326 1 EDT
2020 1326 1 EDT

and I got:
2020 1326 1 EDT
2020 1326 1

Thanks for answers... 🙂

Tom · Posted 05-25-2020 10:43 PM

You don't appear to want a merge. That would be used to combine variables that exist in different datasets.

You appear to be doing a lookup. If the "one" dataset is small enough you can load it into a hash object.

data many;
  input year id value :$9. ;
datalines;
2020 1326 . 
2020 1326 . 
;
data one;
  input year id value :$9. ;
datalines;
2020 1326 EDT
;
data want;
  if _n_=1 then do;
    declare hash h(dataset:'one');
    rc=h.definekey('year','id');
    rc=h.definedata(all:'yes');
    rc=h.definedone();
    drop rc;
  end;
  set many ;
  if h.find() then delete;
run;

Tom · Posted 05-25-2020 10:52 PM

Consider how a data step works. When you read an observation from a dateset it copies the data from that observation into the current observation that the step is working with. So if you merge MANY and ONE on YEAR and ID like this:

merge many one ;
by year id;

then for the first observation in the set of records with the current values of YEAR and ID it first reads the first observation from MANY. Then it reads the observation from ONE and the values of any like named variables are overwritten by what is found in ONE. But on the second observation in this group there is a new observation to read from MANY so its values will overwrite what was already stored in the step (the values from ONE) but since there are no more observations from ONE to read for this group nothing is read from ONE this time. So on the second observation you get the value that was in MANY and not the value that was in ONE. If you had dropped the variables from the MANY dataset then the will not overwrite the remembered value from the single observation read from the ONE dataset.

data want;
  merge one many(drop=value);
  by year id;
run;

Consider an SQL join. Perhaps like this.

create table want as
select a.*,b.*
from one a
inner join many b
on a.year=b.year and a.id=b.id
;

Now SQL handles like named variables differently. If you remove the CREATE TABLE part of the query then it just takes both versions and prints them. But since SAS cannot have two variables in the new WANT dataset with the same name it has to pick one. What is does is the first to the post. So the first copy that is named in the select statement is kept and any other variables you try to select with the same name are just discarded.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away