Solved: Re: Update unmatched missing values from left join

SK_11 · Posted 02-23-2021 06:06 PM

I have the following

data have1;
input Grp cum ;
datalines;
1 2
2 4
3 6
4 8
5 10
6 12
;
run;

data have2;
input id ;
datalines;
1
2
3
4
5
6
7
8
9
10
11
12
;
run;

proc sql;
Create table want as
Select a.Id,
b.Grp
from have2 as a
left join have1 as b
on a.id=b.Cum
;

The above gives the output as

id	Grp
1	.
2	1
3	.
4	2
5	.
6	3
7	.
8	4
9	.
10	5
11	.
12	6

I would like to get this:

id	Grp
1	1
2	1
3	2
4	2
5	3
6	3
7	4
8	4
9	5
10	5
11	6
12	6

Thanks for your help in advance.

Tom · Posted 02-24-2021 01:07 AM

If you want to "join" HAVE1 and HAVE2 then first convert HAVE1 to something with a range of values of ID.

data groups;
  set have1;
  by cum;
  min=sum(lag(cum),1);
  max=cum;
  keep grp min max;
run;

Now you can join HAVE2 with GROUPS and assign the GRP value to each observation in HAVE2.

proc sql ;
create table want as
  select * from have2 a inner join groups b
  on a.id between b.min and b.max
;
quit;

View solution in original post

qoit · Posted 02-23-2021 07:49 PM

SQL is not really the solution for this but here you go:

data have1;
	input Grp cum;
	datalines;
1 2
2 4
3 6
4 8
5 10
6 12
;
run;

data have2;
	input id;
	datalines;
1
2
3
4
5
6
7
8
9
10
11
12
;
run;

proc sql;
	Create table want (keep=id grp) as
		Select a.Id,
			b.Grp as grp_,
		case 
			when b.grp is missing then monotonic() - 1 
		end 
	as col,
		case 
			when b.grp is missing then sum(1,calculated col) 
			else b.grp 
		end 
	as grp
		from have2 as a
			left join have1 as b
				on a.id=b.Cum;
quit;

SK_11 · Posted 02-23-2021 10:32 PM

Thanks for your quick reply, I have updated the dataset to clear any confusion.

Tom · Posted 02-23-2021 07:57 PM

What is the reasoning that says ID=1 should be matched to GRP=1? I do not see the connection.

SK_11 · Posted 02-23-2021 08:22 PM

There is no relationship between group and id. This is a mere representation of more than 100K rows.

SK_11 · Posted 02-23-2021 10:30 PM

To remove the confusion here is the updated data set.

data have1;
input Grp cum ;
datalines;
1 3252
2 6752
3 10252
4 13752
5 17252
6 20752
;
run;

data have2;
do i=1 to 20000;
id=i;output;
end;
drop i;
run;

Tom · Posted 02-23-2021 11:24 PM

You must have had some algorithm in your mind when went from your two example input datasets to your desired output dataset.

What was the logic that said ID=1 was part of GRP=1 instead of GRP=2 or 3 or 3457?

SK_11 · Posted 02-23-2021 11:34 PM

I am trying to fill up the missing group values when I left join. The desired output will be all ids less than cum should have the same group value.

Tom · Posted 02-23-2021 11:39 PM

@SK_11 wrote:

I am trying to fill up the missing group values when I left join. The desired output will be all ids less than cum should have the same group value.

There must be more to it than that. If HAVE1 is

data have1;
  input Grp cum;
datalines;
1 2
2 4
3 6
4 8
5 10
6 12
;

The ID=1 will be a member of all 6 GRP values since one is less than 2 and also less than 4 and less then 6 etc.

Why would you want to try to force a solution to this problem using SQL?

Tom · Posted 02-23-2021 11:47 PM

Do you just want to do this?

Given this input:

data have1;
  input Grp cum;
datalines;
1 2
2 4
3 6
4 8
5 10
6 12
;

Run this data step:

data want;
  set have1;
  do id=sum(lag(cum),1) to cum;
     output;
  end;
run;

Result:

Obs    Grp    cum    id

  1     1       2     1
  2     1       2     2
  3     2       4     3
  4     2       4     4
  5     3       6     5
  6     3       6     6
  7     4       8     7
  8     4       8     8
  9     5      10     9
 10     5      10    10
 11     6      12    11
 12     6      12    12

SK_11 · Posted 02-23-2021 11:48 PM

Thanks for your quick reply. Any solutions would be fine. I have given a SQL example to show the missing values when I left join. I am interested to fill up the missing values of the group column where ids are less than the matched ids. I have updated the example data set as
data have1;
input Grp cum ;
datalines;
1 3252
2 6752
3 10252
4 13752
5 17252
6 20752
;
run;

data have2;
do i=1 to 20000;
id=i;output;
end;
drop i;
run;

Tom · Posted 02-24-2021 12:50 AM

What does HAVE2 add to the problem? Are you just trying to set an upper bound on the value of ID?

data have1;
input Grp cum ;
datalines;
1 3252
2 6752
3 10252
4 13752
5 17252
6 20752
;
data want;
  set have1;
  do id=sum(lag(cum),1) to min(cum,20000);
     output;
  end;
run;
proc freq data=want;
  tables grp ;
run;

The FREQ Procedure

                                Cumulative    Cumulative
Grp    Frequency     Percent     Frequency      Percent
--------------------------------------------------------
  1        3252       16.26          3252        16.26
  2        3500       17.50          6752        33.76
  3        3500       17.50         10252        51.26
  4        3500       17.50         13752        68.76
  5        3500       17.50         17252        86.26
  6        2748       13.74         20000       100.00

Tom · Posted 02-24-2021 01:07 AM

If you want to "join" HAVE1 and HAVE2 then first convert HAVE1 to something with a range of values of ID.

data groups;
  set have1;
  by cum;
  min=sum(lag(cum),1);
  max=cum;
  keep grp min max;
run;

Now you can join HAVE2 with GROUPS and assign the GRP value to each observation in HAVE2.

proc sql ;
create table want as
  select * from have2 a inner join groups b
  on a.id between b.min and b.max
;
quit;

SK_11 · Posted 02-24-2021 03:22 AM

Thanks a lot Tom

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away