Hello all. I have a dataset that has numerous amounts of array data concatenated together by a common string within a cell (string is a ';'). I have developed a process that will loop through each variable , split it out into X new variables, and then tranpose them (maintaining the data order, that step is 100% CRUCIAL)! An example data is given below
Key Var
1 a;A;B;b;
2 c;b;c;b
Desired results
Key Var
1 a
1 A
1 B
1 b
2 c
2 b
2 c
2 b
My process basically looks like the code below, plus some statements to merge the datasets back together. Can anyone tell me if they know a more efficient way of coding this, as the tranpose on a million records (million records after the tranpose) takes about 2-3 seconds, and since I will be doing this hundreds of times for extremely large amounts of data, and then have to set the tables back together, it takes far too long for my desired purposes.
Thank you in advance!
proc contents data=key_vars out=mydata(keep=name varnum);
run;
proc sort data=mydata;
by varnum;
run;
proc sql;
select count(*) into :total_vars
from mydata;
quit;
run;
%do i=1 %to &total_vars;
data _null_;
set mydata;
if _n_=1 then call symputx('variable',name);
run;
%let new_variable_name=new.&variable;
DATA Work.MAXCOUNT (keep=&New_Variable_Name.);
SET MASTER (keep=&Variable);
&New_Variable_Name.=count(&Variable_Name.,';');
RUN;
PROC MEANS data=Work.MAXCOUNT;
var &New_Variable_Name.;
output out=Work.meansdata(drop=_type_ _freq_) max=;
RUN;
DATA Work.meansdata;
SET Work.meansdata;
  if &New_Variable_Name. > 0 then &New_Variable_Name.=&new_variable_name.+1;
RUN;
DATA _null_;
SET meansdata;
  call symput('vararray_length',&new_variable_name);
RUN;
%LET countvar= &vararray_length.;
DATA &drive..&ds (drop=I compress=yes);
SET &drive..MASTER(keep=&variable_name. &keyvars);
ARRAY a{&countvar.} $1000
  a1-a&countvar.;
DO I=1 to &countvar.;
a{I}= strip(SCAN(&Variable_Name,(i),';','m'));
END;
RUN;
proc transpose data=&ds. out=&ds. (rename=(col1=&variable_name. _name_=new) keep=&keyvars _name_ col1
compress=yes );
by &keyvars;
var a1-a&countvar.;
run;
%end;
Thanks in advance!
You can generalize to as many variables as you want by using arrays.
data have;
input key var1 $ var2 $ ;
cards4;
1 a;b;c ;;B
2 b;b;b C;;D
;;;;
data want ;
set have ;
array old var1-var2;
do i=1 to dim(old);
max = max(max,countw(old(i),';','m'));
end;
array new $8 newvar1-newvar2;
do i=1 to max;
do j=1 to dim(old);
new(j) = scan(old(j),i,';','m');
end;
output;
end;
drop j max var1-var2;
rename newvar1-newvar2 = var1-var2 ;
run;
Obs key i var1 var2
1 1 1 a
2 1 2 b
3 1 3 c B
4 2 1 b C
5 2 2 b
6 2 3 b D
I may not understand what you are trying to do. Does the following approximate what you want to do?
data have;
informat var $80.;
input Key Var;
cards4;
1 a;A;B;b;
2 c;b;c;b
;;;;
data want (keep=key var);
format key best12.;
length var $1;
set have (rename=var=var_in);
i=1;
do while (scan(var_in,i,";") ne "");
var=scan(var_in,i,";");
output;
i+1;
end;
run;
What is the question? The program doesn't appear to be transposing the data. Instead it is leaving it horizontal and just splitting the values from the delimited string into multiple variables. So if you had a variable named FRED with value 'a;b;c' you would generate FRED1='a' , FRED2='b' and FRED3='c'. Am I reading the intent properly?
Are you just having trouble determining the maximum number of concatenated values in a variable?
Do you what a coding pattern that will let you split multiple variables at once?
proc sql noprint ;
select max(countw(VAR1,';','m'))
, max(countw(VAR2,';','m'))
into :countvar1
       , :countvar2 
from mydata
;
quit;
data want;
set mydata ;
array var1_ (&countvar1) $10 ;
array var2_ (&countvar2) $10 ;
do _n_ = 1 to &countvar1;
var1_ (_n_) = scan(var1,_n_,';','m');
end;
do _n_ = 1 to &countvar2;
var2_ (_n_) = scan(var2,_n_,';','m');
end;
   drop var1 var2 ;
run;
I'm sorry but the code sample I gave is not a full sample but a generic outline of what I am going, It is a very generic example. Tom, you have step 1 correct in what i'm trying to do. . First I split the variable (call it var 1) into X different variables (a1,a2,a3, etc...) then the transpose step transposes these by the key variables that I define.
Aka
Key Var
1 a;b;c
2 b;b;b;
Becomes
key var a1 a2 a3
1 a;b;c; a b c
2 b;b;b; b b b
Then when the tranpose happens it become the dataset that I showed in my example. I might be able to alter the full code to give the full example with some more time tomorrow.
I then strip the tranposed column of the A's (since the varialbes become a1,a2,a3 etc..) and then order by the the remaining numbers, that way data order is maintained.
The array process makes new varialbes and the tranpose step tranposes then.
ARTHUR, this is exactly what I need the process to do, however there are a few minor issues. In my data, it is possible to have valid null data in the strings. Example, the string a;;b;;d;;d;e;d would be a valid string, and we would need the second row of data to be a true null. I also need it to run dynamically for any number of columns (can be different number of columns with different names). Is this something that you can adjust your code to do? (My original code looked similar to yours however I could not figure out how to deal with the null isue). Thanks again everyone! All advice here is extremely helpful!
 
I am still at a loss to understand what the issue is. If you just want to rotate the concatenated values into observations then a simple DO loop with an embedded OUTPUT statement will do that. You should remember to keep the loop counter so that you can preserve the order.
Was there a question in there about how to deal with more than one of these concatenated variables in a single pass through the data? If so can you give an example of the output you would like?
data want ;
set have ;
do i=1 to countw(var,';','m');
newvar=scan(var,i,';','m');
output;
end;
run;
data _null_;
set want ;
put key i newvar var ;
run;
1 1 a a;b;c
1 2 b a;b;c
1 3 c a;b;c
2 1 b b;b;b;
2 2 b b;b;b;
2 3 b b;b;b;
2 4 b;b;b;
HELLO Tom. You are correct the question is how to put the data going the long direction instead of the wide, however doing it with multiple variables. For example, the following dataset below would below the after dataset if working correctly.
Key Var1 Var2
1 a;b;c ;;B
2 b;b;b C;;D
Becomes
key var1 Var2
1 a NULL
1 b NULL
1 c B
2 b C
2 b NULL
2 b D
Note that I wrote the word NULL instead of leaving it blank for illustration purposes. Basically I dont know how to loop through each variable in a dataset when you can have 100 different columns you wish to loop through in one data step. It also has to be dynamic, as the column names and numbers can change (and do) daily.
I could theoritically split it up by unique variable and then merge (which is similar to what I'm doing now), but i'm looking for the ability to do it all in one dataset to maximze efficiency. [To be honest the output method might still be more efficient than the method I have been using, I will perform some tests on this tomorrow.]
Thanks again
Brandon
So just adjust the upper limit of your loop counter to account for maximum over all of the variables that you are converting.
data want ;
set have ;
do i=1 to max(countw(var1,';','m'),countw(var1,';','m'));
newvar1=scan(var1,i,';','m');
newvar2=scan(var2,i,';','m');
output;
end;
run;
You can generalize to as many variables as you want by using arrays.
data have;
input key var1 $ var2 $ ;
cards4;
1 a;b;c ;;B
2 b;b;b C;;D
;;;;
data want ;
set have ;
array old var1-var2;
do i=1 to dim(old);
max = max(max,countw(old(i),';','m'));
end;
array new $8 newvar1-newvar2;
do i=1 to max;
do j=1 to dim(old);
new(j) = scan(old(j),i,';','m');
end;
output;
end;
drop j max var1-var2;
rename newvar1-newvar2 = var1-var2 ;
run;
Obs key i var1 var2
1 1 1 a
2 1 2 b
3 1 3 c B
4 2 1 b C
5 2 2 b
6 2 3 b D
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
