A recent PharmaSUG China paper (this one) inspired me, but I'm wondering if I can make it more efficient. The focus here is on 'MERGING SUPP DOMAIN', pages 1-4 of the paper.
As some very brief background, in clinical data standards a SUPP (supplemental qualifier) domain is a dataset that contains additional non-standard variables in long format, meaning it is to be transposed and merged back into its core dataset before it can be used. The merge key is IDVAR, which identifies the name of the key variable, and IDVARVAL, its value. The transpose key is QNAM (variable name) and QVAL (value). Keep in mind IDVAR and QNAM are meta-variables - their values refer to actual variables.
The paper very nicely shows how you can replace separate sorts, transposes and merges by a single hash table-based data step, but it leaves in some assumptions: that you know/specify ahead of time what the IDVAR value is going to be (glossing over the fact that there can be multiple IDVAR values), and that you know/specify all distinct values for QNAM (glossing over the fact that these could occur many-to-one and may require extension/deduplication).
I've extended this approach so multiple IDVAR values are handled in separate hash tables, and QNAMs each take a spot in some pre-allocated array. The one remaining problem is that you only know the total number and names of the new variables after you've enumerated all QNAMs (and checked for many-to-one relations), so this array uses placeholder names (say, TEMP1 - TEMPn) and dimensions (bigger by a margin than the likely total count of new variables). All of this still happens in a single data step and by reading each dataset once, which is nice because they can grow pretty big. The one remaining step after all of this is to rename the variables (TEMP1 -> QNAM1, TEMP2 -> QNAM2... according to the QNAM values) and drop the unused TEMPx placeholders.
Renaming the variables can be done through PROC DATASETS which doesn't read the data again, but I'm wondering if there's a way to drop the unused placeholders, or handle the variable creation/naming dynamically in the first place (not having to over-allocate an array), that doesn't require a second pass through the data?
Since that paper apparently has a very targeted audience it does not provide any example data so I am unable to follow exactly what my be needed.
If your concern is the several places where they have a series of If/then to assign values based on literal values it may be possible to create a custom format/informat that assigns values. Again, how to do that depends on the actual contents of the data and I do not see any example. But if you can extract the values from a data set and data set information that may help.
If you use arrays for the variables like XX1, XX2, XX3 in the example and you know, or can get the list of values from of Qnam from the data, possibly a macro variable, you can use the WHICHC function to find the value in an ordered list and assign the value to an array based on the order.
A brief example:
data example; input qnam $; array x(3) $ ; x[whichc(qnam,"xx1","xx2","xx3") ]=qnam; datalines; xx1 xx3 xx2 xx1 ;
The list "xx1","xx2","xx3" could be built as a macro variable with code such as
proc sql noprint; select distinct quote(strip(qnam)) into : qnamlist separated by ',' from example; quit; %let qnamcount=&sqlobs; %put value list: &qnamlist. count: &qnamcount.;
The SQLOBS is an automatic variable created that has the number of results of the most recent Proc SQL select. So using this you could replace the list in the Whichc call with &qnamlist. and use the &qnamcount. in the Array x definition instead of 3.
There is also a parallel function WhichN to search number lists but conversion of numeric values to text for the searc (without quotes) may be problematic from rounding errors and choice of format.
Do the "placeholder" variables need to be actual variables? Or can they be a _TEMPORARY_ array?
It is always disappointing when you see a clear misunderstanding of SAS code on the first page of a paper.
Such as the use of BEST as in INFORMAT.
xxseq=input(idvarval,best.);
Better to use the actual name of the informat and the maximum width instead of the default width.
xxseq=input(idvarval,32.);
Let me see if I have the concept right. You want to get the set of distinct values of QNAM so you can generate the series of IF statements (why not a SELECT statement?) in this block.
do while (h.do_over()=0);
if qnam="xx1" then xx1=qval;
if qnam="xx2" then xx2=qval;
if qnam="xx3" then xx3=qval;
end;
Is there not a already source for that list of QNAM values you can use to generate that code block? Is there not something like a DEFINE.DOC file that defines the study you can reference that will have data?
If not then generate it in advance from one (both? all?) of the datasets you are reading. You can use a SAS procedure to do it efficiently. Perhaps one like PROC MEANS that can use multiple threads.
Thank you for the input! It is indeed a pretty niche problem that's the result of common data standards in clinical trials, but perhaps I can rephrase it more generally - the crux of my question is that I'm trying to replicate a PROC TRANSPOSE through a hash table in a single data step (the merge happens at the same time but that's inconsequential). As a very simple example, this is almost what I'm trying to do:
data HAVE;
length IDVARVAL 8 QNAM QVAL $10;
infile datalines delimiter='|';
input IDVARVAL QNAM $ QVAL $;
cards;
1 | FOO | ONE
1 | BAR | TWO
2 | FOO | THREE
2 | BAR | FOUR
2 | FOO | FIVE
3 | BAZ | SIX
;
run;
/* Will not work given this input, QNAM=FOO appears twice for IDVARVAL=2. */
proc transpose data=HAVE out=WANT;
by IDVARVAL;
var QVAL;
id QNAM;
run;
A problem with the above code, which is already handled in my solution, is that duplicate QNAM (ID) values within IDVARVAL (BY) should be deduplicated with a suffix or so, i.e. this is the desired output data (variable order doesn't matter too much):
IDVARVAL | FOO | BAR | BAZ | FOO_DUP1 |
1 | ONE | TWO | ||
2 | THREE | FOUR | FIVE | |
3 | SIX |
In the hash table this is handled through a pre-allocated array which is not yet aware of what QNAMs it will receive, enumerating the distinct QNAMs, and then during the merge checking whether that QNAM has already occurred for the current IDVARVAL; if so this needs to be made into a new QNAM. It's a bit more complicated than I'm showing here because there might be different IDVARs that have different IDVARVALs which end up being the same record in the end, but let's ignore that for now. This is the output that I do have:
IDVARVAL | TEMP1 | TEMP2 | TEMP3 | TEMP4 | TEMP5 | ... | TEMPn |
1 | ONE | TWO | |||||
2 | THREE | FOUR | FIVE | ||||
3 | SIX |
In addition, I have the actual names of the transposed QNAMs (i.e. TEMP1 = FOO, TEMP2 = BAR, TEMP3 = BAZ, TEMP4 = FOO_DUP1) and the maximum number of deduplicated QNAMs (4, so TEMP5 - TEMPn have to be dropped) in macro variables created during the same data step via call symput (so they will only be available after execution). All of this happens in a single data step with a single pass through each dataset, and I would like to keep it that way if possible; some of these datasets are >>tens of millions of records.
I can rename the required variables, currently only known after the above data step, through PROC DATASETS very cheaply, but I don't know if there is a way to drop the TEMP5 - TEMPn overflow without reading the entire dataset again. The latter is what I'm doing now, just a second data step with a drop statement, which (maybe?) leaves the opportunity to cut runtime in half. Alternatively, if there is another way of handling the transpose and the deduplication (i.e. dynamically allocating and naming variables within one data step) in the first place that might also work.
Reading in the distinct QNAMs via PROC SQL or some other procedure would require a second pass through the (very big) SUPP dataset, and neither this nor the DEFINE specification resolves the deduplication issue without additional processing. It can also not be a _TEMPORARY_ array, because the goal is exactly to create new variables on the output. All of this already happens in my current single data step, I'm trying to figure out if I can have my cake and eat it too 😋
Transpose to a long form, so that the variable names become data, add columns, and then re-transpose:
data have;
input id var1 var2;
datalines;
1 2 3
;
data add;
input varname $ value;
datalines;
var3 4
;
proc transpose data=have out=long;
by id;
var var:;
run;
data long2;
set long;
by id;
output;
if last.id
then do;
do i = 1 to nobs;
set add (rename=(varname=_name_ value=col1)) point=i nobs=nobs;
output;
end;
end;
run;
proc transpose data=long2 out=want (drop=_name_);
by id;
var col1;
id _name_;
run;
You may have to split the process for character and numeric variables.
If you also need to dynamically drop variables, you can read the "drop" table into a hash (in the long2 data step) and use it as a lookup table for deleting observations.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.