Re: Increased number of observations after left join. Why?

mariko5797 · Posted 11-05-2021 11:42 AM

I have data that looks like the following:

data _final_;
 input dosenum $ actarmcd $ test $ _name_ $ maxsev cnt pct ci $10.;
 cards;
 	00	ANY	ANY	ANY	0	32	38	27.7,49.3
	00	ANY	ANY	ANY	1	13	15	15.2,34.3
	00	ANY	ANY	ANY	2	45	54	42.4,64.5
	00	ANY	ANY	ANY	3	19	23	14.2,33.0
	00	ANY	CHE	ANY	0	50	60	48.3,70.1
	00	ANY	CHE	ANY	1	46	55	43.5,65.7
	00	ANY CHE ANY 2	4	5	1.3, 11.7
	00	ANY CHE ANY 3	23	27	18.2,38.2
	00	ANY	CHE	AL	0	6	7	2.7,14.9
	00	ANY	CHE	AL	2	4	5	1.3, 11.7
	00	ANY	CHE	CR	0	19	23	14.2,33
	00	ANY	CHE	CR	1	46	55	43.5,65.7
	00	ANY	CHE	CR	3	23	27	18.2,38.2
	00	ANY	HEM	PL	0	53	63	51.9,73.4
 ;
run;

DOSENUM ranges from 00-03. There are 5 different treatment groups (ACTARMCD). TEST is defining the type of test (CHE, HEM, URN) and _NAME_ is the specific type of measurement. MAXSEV is the severity grading. As of right now, the _FINAL_ dataset counts no events (MAXSEV = 0) and existing events (MAXSEV > 0). However, if an event did not occur, it is not included (e.g. If only two moderately severe increases in alanine (TEST = CHE, _NAME_ = AL) occurred for first dose (DOSENUM = 01) in Group 1, then only MAXSEV = 0 and 1 are recorded). I want MAXSEV 0-3 available for all existing combinations of dose, treatment, test type, and measurement. That is, zero would just be recorded if there was nothing rather than excluded entirely.

My attempt:

1936  proc sql;
1937   create table maxsev_ as
1938      select distinct dosenum, actarmcd, test, _name_, 0 as maxsev from _final_
1939      union corr
1940      select distinct dosenum, actarmcd, test, _name_, 1 as maxsev from _final_
1941      union corr
1942      select distinct dosenum, actarmcd, test, _name_, 2 as maxsev from _final_
1943      union corr
1944      select distinct dosenum, actarmcd, test, _name_, 3 as maxsev from _final_;
NOTE: Compressing data set WORK.MAXSEV_ increased size by 100.00 percent.
      Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: Table WORK.MAXSEV_ created, with 1144 rows and 5 columns.

1945  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.02 seconds
      cpu time            0.01 seconds


1945!       /*1144 obs*/
1946
1947  proc sql;
1948   create table final_ as
1949      select a.*, b.cnt, b.pct, b.ci
1950          from maxsev_ as a left join _final_ as b
1951          on a.actarmcd = b.actarmcd and a.dosenum = b.dosenum and a._name_ = b._name_ and a.maxsev = b.maxsev;
NOTE: Compressing data set WORK.FINAL_ increased size by 50.00 percent.
      Compressed is 3 pages; un-compressed would require 2 pages.
NOTE: Table WORK.FINAL_ created, with 1486 rows and 8 columns.

1952  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.05 seconds
      cpu time            0.03 seconds


1952!       /*1486 obs*/

The number of observations increased to more observations than in _FINAL_ (475 obs) and MAXSEV_ (1144 obs). Why might the number of observations not be equivalent to MAXSEV_?

ballardw · Posted 11-05-2021 12:16 PM

When you do a left (or right) join on a variable or combinations of variables if the right (or left) set, the second one used has multiple values for the join variables then each of the records in the second set will get added.

If you have multiples of the join variables in both sets then each one gets matched to the other set so the increase can be quite noticeable in those cases.

mariko5797 · Posted 11-05-2021 12:35 PM

I shouldn't have repeat combinations (or at least that was the goal), so I'll have to go back and double check my data.

Thank you!

Increased number of observations after left join. Why?

Re: Increased number of observations after left join. Why?

Re: Increased number of observations after left join. Why?

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away