I have data that looks like the following:
data _final_;
input dosenum $ actarmcd $ test $ _name_ $ maxsev cnt pct ci $10.;
cards;
00 ANY ANY ANY 0 32 38 27.7,49.3
00 ANY ANY ANY 1 13 15 15.2,34.3
00 ANY ANY ANY 2 45 54 42.4,64.5
00 ANY ANY ANY 3 19 23 14.2,33.0
00 ANY CHE ANY 0 50 60 48.3,70.1
00 ANY CHE ANY 1 46 55 43.5,65.7
00 ANY CHE ANY 2 4 5 1.3, 11.7
00 ANY CHE ANY 3 23 27 18.2,38.2
00 ANY CHE AL 0 6 7 2.7,14.9
00 ANY CHE AL 2 4 5 1.3, 11.7
00 ANY CHE CR 0 19 23 14.2,33
00 ANY CHE CR 1 46 55 43.5,65.7
00 ANY CHE CR 3 23 27 18.2,38.2
00 ANY HEM PL 0 53 63 51.9,73.4
;
run;
DOSENUM ranges from 00-03. There are 5 different treatment groups (ACTARMCD). TEST is defining the type of test (CHE, HEM, URN) and _NAME_ is the specific type of measurement. MAXSEV is the severity grading. As of right now, the _FINAL_ dataset counts no events (MAXSEV = 0) and existing events (MAXSEV > 0). However, if an event did not occur, it is not included (e.g. If only two moderately severe increases in alanine (TEST = CHE, _NAME_ = AL) occurred for first dose (DOSENUM = 01) in Group 1, then only MAXSEV = 0 and 1 are recorded). I want MAXSEV 0-3 available for all existing combinations of dose, treatment, test type, and measurement. That is, zero would just be recorded if there was nothing rather than excluded entirely.
My attempt:
1936 proc sql;
1937 create table maxsev_ as
1938 select distinct dosenum, actarmcd, test, _name_, 0 as maxsev from _final_
1939 union corr
1940 select distinct dosenum, actarmcd, test, _name_, 1 as maxsev from _final_
1941 union corr
1942 select distinct dosenum, actarmcd, test, _name_, 2 as maxsev from _final_
1943 union corr
1944 select distinct dosenum, actarmcd, test, _name_, 3 as maxsev from _final_;
NOTE: Compressing data set WORK.MAXSEV_ increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: Table WORK.MAXSEV_ created, with 1144 rows and 5 columns.
1945 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.02 seconds
cpu time 0.01 seconds
1945! /*1144 obs*/
1946
1947 proc sql;
1948 create table final_ as
1949 select a.*, b.cnt, b.pct, b.ci
1950 from maxsev_ as a left join _final_ as b
1951 on a.actarmcd = b.actarmcd and a.dosenum = b.dosenum and a._name_ = b._name_ and a.maxsev = b.maxsev;
NOTE: Compressing data set WORK.FINAL_ increased size by 50.00 percent.
Compressed is 3 pages; un-compressed would require 2 pages.
NOTE: Table WORK.FINAL_ created, with 1486 rows and 8 columns.
1952 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.05 seconds
cpu time 0.03 seconds
1952! /*1486 obs*/
The number of observations increased to more observations than in _FINAL_ (475 obs) and MAXSEV_ (1144 obs). Why might the number of observations not be equivalent to MAXSEV_?
When you do a left (or right) join on a variable or combinations of variables if the right (or left) set, the second one used has multiple values for the join variables then each of the records in the second set will get added.
If you have multiples of the join variables in both sets then each one gets matched to the other set so the increase can be quite noticeable in those cases.
I shouldn't have repeat combinations (or at least that was the goal), so I'll have to go back and double check my data.
Thank you!
Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.
Register today!Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Select SAS Training centers are offering in-person courses. View upcoming courses for: