Solved: How to score elements of an array specified in another dataset

Coreyw184 · Posted 03-10-2025 07:14 PM

I am sorry if this is a straight forward question but I cannot figure out how to solve this.

I have two datasets, one of values where each row is a set of values (and other columns but they are excluded for the sake of example) and one of sequences which I would like to sum for each observation. In the example data below I would like to have 4 new columns in the values data seq1_sum, seq2_sum, seq3_sum, and seq4_sum. These sums would be the sums over each of the sequences respectively. For example the seq4 column would be the sum across each row for score3, score4, and score7.

sequences:

values:

Tom · Posted 03-10-2025 11:22 PM

Use PROC SCORE.

https://documentation.sas.com/doc/en/statug/latest/statug_score_syntax01.htm

You might have to restructure your "scoring" dataset to get it to work.

If you want help provide actual datasets. And the answer you are looking for.

You will probably want to reduce the problem to just a few variables and a few observations of data.

View solution in original post

quickbluefish · Posted 03-10-2025 08:38 PM

Here is some fake data to more or less match what you have in your screenshots - run this and use PROC PRINT to look at these if you like:

data seqs;
length seq $20 sequences $6 s1-s8 3;
array s {*} s1-s8;
do i=1 to dim(s);
	s[i]=(rand('uniform')<0.3);
end;
drop i;
call symputx("nseqs",_N_-1);
infile cards dsd truncover firstobs=1 dlm='|';
input seq sequences;
cards;
1,2,3,4,5,6,7,8|seq1
1,2,3,4|seq2
1,3,6,8|seq3
3,4,7|seq4
;
run;

data values;
length score1-score8 3;
array s {*} score1-score8;
do r=1 to 20;
	do i=1 to dim(s);
		s[i]=(rand('uniform')<0.6);
	end;
	output;
end;
drop r i;
run;

I can't say I really understand your request - especially whether you're trying to sum things vertically or horizontally. My assumption based on what you said is that you want to sum horizontally... :

data want;
set
	seqs (in=A)
	values
	;
array T {&nseqs} $20 _temporary_;
array sc {*} score1-score8;
array sq {*} seq1-seq4;
if A then T[_N_]=seq;
else do;
	call missing(of sq[*]);
	do i=1 to dim(sq);
		do s=1 to countW(T[i],',');
			sq[i]+sc[scan(T[i],s,',')*1];
		end;
	end;
	output;
end;
keep score1-score8 seq1-seq4;
run;

proc print data=want; run;

Ksharp · Posted 03-10-2025 09:19 PM

Next time, you 'd better post your data by using a data step code,not just a picuture. Nobody would like to type it for you . This time I used quickbluefish 's code to generate your sample data.

data seqs;
length seq $20 sequences $6 s1-s8 3;
array s {*} s1-s8;
do i=1 to dim(s);
	s[i]=(rand('uniform')<0.3);
end;
drop i;
call symputx("nseqs",_N_-1);
infile cards dsd truncover firstobs=1 dlm='|';
input seq sequences;
cards;
1,2,3,4,5,6,7,8|seq1
1,2,3,4|seq2
1,3,6,8|seq3
3,4,7|seq4
;
run;

data values;
length score1-score8 3;
array s {*} score1-score8;
do r=1 to 20;
	do i=1 to dim(s);
		s[i]=(rand('uniform')<0.6);
	end;
	output;
end;
drop r i;
run;

filename x temp;
data _null_;
 set seqs;
 file x;
 x=cats(sequences,'_sum=sum(',prxchange('s/(\d+)/score\1/',-1,seq),');');
 put x;
run;
data want;
 set values;
 %include x/source;
run;

mkeintz · Posted 03-10-2025 09:42 PM

What you apparently want is code that looks like this:

data want;
  set values;
  seq1 = sum(score1 ,score2 ,score3 ,score4 ,score5 ,score6 ,score7 ,score8 );
  seq2 = sum(score1 ,score2 ,score3 ,score4 );
  seq3 = sum(score1 ,score3 ,score6 ,score8 );
  seq4 = sum(score3 ,score4 ,score7 );
run;

You can use a DATA _NULL_ step reading dataset SEQ to produce that code in a temporary file, which can then be INCLUDED in a subsequent DATA WANT step, as in:

filename sumcode temp;
data _null_;
  set seq;

  array sdummies {*} s1-s8 ;

  file sumcode;
  put seqnames '= sum(' @;
  do i=1 to dim(sdummies);
    if sdummies{i}=1 then put 'score' i ',' @;
  end;
  put + (-1) ');';
run;

data want;
  set values;
  %include sumcode / source2 ;
run;

The primary tool used here is the trailing @ used in the PUT statements, which holds the output column pointer in place, waiting for the next PUT statement. The last PUT has the pointer control '+ (-1)' which moves the pointer back one character, in order to replace the troublesome ~~unnecessary~~ comma appended to the last variable name with a closing paren and a semi-colon.

This program is untested in the absence of sample data in the form of working DATA steps.

I see other responders have utilized the character variable SEQ, instead of the S1 through S8 variables that I used. If those variables are not available, then change the DATA _NULL_ step to:

filename sumcode temp;
data _null_;
  set seq;
  file sumcode;

  put seqnames '= sum(' @;
  do i=1 to countw(seq);
    addend=cats('score',scan(seq,i)) ;
    if i<countw(seq) then addend=cats(addend,',');
    put addend @;
  end;
  put ');';
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Tom · Posted 03-10-2025 11:22 PM

Use PROC SCORE.

https://documentation.sas.com/doc/en/statug/latest/statug_score_syntax01.htm

You might have to restructure your "scoring" dataset to get it to work.

If you want help provide actual datasets. And the answer you are looking for.

You will probably want to reduce the problem to just a few variables and a few observations of data.

Coreyw184 · Posted 03-11-2025 05:15 PM

PROC SCORE was exactly what I needed, thank you!

Ksharp · Posted 03-10-2025 11:43 PM

I was motivated by Tom's PROC SCORE.

You could use IML code to get the desired output which is most simple and efficient.

data seqs;
length seq $20 sequences $6 s1-s8 3;
array s {*} s1-s8;
do i=1 to dim(s);
	s[i]=(rand('uniform')<0.3);
end;
drop i;
call symputx("nseqs",_N_-1);
infile cards dsd truncover firstobs=1 dlm='|';
input seq sequences;
cards;
1,2,3,4,5,6,7,8|seq1
1,2,3,4|seq2
1,3,6,8|seq3
3,4,7|seq4
;
run;

data values;
length score1-score8 3;
array s {*} score1-score8;
do r=1 to 20;
	do i=1 to dim(s);
		s[i]=(rand('uniform')<0.6);
	end;
	output;
end;
drop r i;
run;




proc iml;
use seqs(keep=s:);
read all var _num_ into s;
close;
use values(keep=score:);
read all var _num_ into score;
close;
want=score*t(s);
create want from want[c=('sum_seq1':'sum_seq'+char(nrow(s)))];
append from want;
close;
quit;

Tom · Posted 03-11-2025 09:26 AM

Make sure your scoring dataset has _NAME_ and _TYPE_ variables.

data seq ;
  _type_='SCORE';
  input description:$20. _name_ $ score1-score8;
cards;
1,2,3,4,5,6,7,8 seq1 1 1 1 1 1 1 1 1
1,2,3,4         seq2 1 1 1 1 0 0 0 0
1,3,6,8         seq3 1 0 1 0 0 1 0 1
3,4,7           seq4 0 0 1 1 0 0 1 0
;

Now let's make up a few observations of data:

data values;
  input id score1-score8;
cards;
1 1 2 3 4 5 6 7 8
2 0 1 0 1 0 1 0 1
3 1 1 1 1 1 1 1 1
;

Then score it

proc score data=values score=seq out=want;
  var score1-score8;
run;

Results:

How to score elements of an array specified in another dataset

Re: How to score elements of an array specified in another dataset

Re: How to score elements of an array specified in another dataset

Re: How to score elements of an array specified in another dataset

Re: How to score elements of an array specified in another dataset

Re: How to score elements of an array specified in another dataset

Re: How to score elements of an array specified in another dataset

Re: How to score elements of an array specified in another dataset

Re: How to score elements of an array specified in another dataset

Registration is open

SAS Training: Just a Click Away