BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Anotherdream
Quartz | Level 8

Hello all. I have a dataset that has numerous amounts of array data concatenated together by a common string within a cell (string is a ';').  I have developed a process that will loop through each variable , split it out into X new variables, and then tranpose them (maintaining the data order, that step is 100% CRUCIAL)! An example data is given below

Key      Var

1          a;A;B;b;

2          c;b;c;b

Desired results

Key     Var

1          a

1          A

1          B

1          b

2          c

2          b

2          c

2          b

My process basically looks like the code below, plus some statements to merge the datasets back together. Can anyone tell me if they know a more efficient way of coding this, as the tranpose on a million records (million records after the tranpose) takes about 2-3 seconds, and since I will be doing this hundreds of times for extremely large amounts of data, and then have to set the tables back together, it takes far too long for my desired purposes.

Thank you in advance!

proc contents data=key_vars out=mydata(keep=name varnum);

run;

proc sort data=mydata;

by varnum;

run;

proc sql;

select count(*) into :total_vars

from mydata;

quit;

run;

%do i=1 %to &total_vars;

data _null_;

set mydata;

if _n_=1 then call symputx('variable',name);

run;

%let new_variable_name=new.&variable;


DATA Work.MAXCOUNT (keep=&New_Variable_Name.);
SET MASTER (keep=&Variable);
&New_Variable_Name.=count(&Variable_Name.,';');
RUN;


PROC MEANS data=Work.MAXCOUNT;
var &New_Variable_Name.;
output out=Work.meansdata(drop=_type_ _freq_) max=;
RUN;

DATA Work.meansdata;
SET Work.meansdata;
  if &New_Variable_Name. > 0 then &New_Variable_Name.=&new_variable_name.+1;
RUN;

DATA _null_;
SET meansdata;
  call symput('vararray_length',&new_variable_name);
RUN;

%LET countvar= &vararray_length.;
DATA &drive..&ds (drop=I compress=yes);
SET &drive..MASTER(keep=&variable_name. &keyvars);
ARRAY a{&countvar.} $1000
  a1-a&countvar.;
DO I=1 to &countvar.;
a{I}= strip(SCAN(&Variable_Name,(i),';','m'));
END;
RUN;

proc transpose data=&ds. out=&ds. (rename=(col1=&variable_name. _name_=new) keep=&keyvars _name_ col1
compress=yes );
by &keyvars;
var a1-a&countvar.;
run;

%end;

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You can generalize to as many variables as you want by using arrays.

data have;

  input key var1 $ var2 $ ;

cards4;

1 a;b;c ;;B

2 b;b;b C;;D

;;;;

data want ;

  set have ;

  array old var1-var2;

  do i=1 to dim(old);

    max = max(max,countw(old(i),';','m'));

  end;

  array new $8 newvar1-newvar2;

  do i=1 to max;

    do j=1 to dim(old);

      new(j) = scan(old(j),i,';','m');

    end;

    output;

  end;

  drop j max var1-var2;

  rename newvar1-newvar2 = var1-var2 ;

run;

Obs    key    i    var1    var2

1      1     1     a

2      1     2     b

3      1     3     c       B

4      2     1     b       C

5      2     2     b

6      2     3     b       D

View solution in original post

7 REPLIES 7
art297
Opal | Level 21

I may not understand what you are trying to do.  Does the following approximate what you want to do?

data have;

  informat var $80.;

  input Key Var;

  cards4;

1          a;A;B;b;

2          c;b;c;b

;;;;

data want (keep=key var);

  format key best12.;

  length var $1;

  set have (rename=var=var_in);

  i=1;

  do while (scan(var_in,i,";") ne "");

    var=scan(var_in,i,";");

    output;

    i+1;

  end;

run;

Tom
Super User Tom
Super User

What is the question?  The program doesn't appear to be transposing the data.  Instead it is leaving it horizontal and just splitting the values from the delimited string into multiple variables.  So if you had a variable named FRED with value 'a;b;c' you would generate FRED1='a' , FRED2='b' and FRED3='c'.   Am I reading the intent properly?

Are you just having trouble determining the maximum number of concatenated values in a variable?

Do you what a coding pattern that will let you split multiple variables at once?


proc sql noprint ;

  select max(countw(VAR1,';','m'))

       , max(countw(VAR2,';','m'))

    into :countvar1

       , :countvar2

    from mydata

  ;

quit;


data want;

   set mydata ;

   array var1_ (&countvar1) $10 ;

   array var2_ (&countvar2) $10 ;

   do _n_ = 1 to &countvar1;

     var1_ (_n_) = scan(var1,_n_,';','m');

   end;

   do _n_ = 1 to &countvar2;

     var2_ (_n_) = scan(var2,_n_,';','m');

   end;

   drop var1 var2 ;

run;

Anotherdream
Quartz | Level 8

I'm sorry but the code sample I gave is not a full sample but a generic outline of what I am going, It is a very generic example. Tom, you have step 1 correct in what i'm trying to do. . First I split the variable (call it var 1) into X different variables (a1,a2,a3, etc...) then the transpose step transposes these by the key variables that I define.

Aka

Key     Var

1          a;b;c

2           b;b;b;

Becomes

key           var          a1     a2     a3

1             a;b;c;         a     b       c

2             b;b;b;         b     b       b

Then when the tranpose happens it become the dataset that I showed in my example. I might be able to alter the full code to give the full example with some more time tomorrow.

         

I then strip the tranposed column of the A's (since the varialbes become a1,a2,a3 etc..) and then order by the the remaining numbers, that way data order is maintained.

The array process makes new varialbes and the tranpose step tranposes then.

ARTHUR, this is exactly what I need the process to do, however there are a few minor issues. In my data, it is possible to have valid null data in the strings. Example, the string a;;b;;d;;d;e;d would be a valid string, and we would need the second row of data to be a true null. I also  need it to run dynamically for any number of columns (can be different number of columns with different names).  Is this something that you can adjust your code to do? (My original code looked similar to yours however I could not figure out how to deal with the null isue). Thanks again everyone! All advice here is extremely helpful!


Tom
Super User Tom
Super User

I am still at a loss to understand what the issue is.  If you just want to rotate the concatenated values into observations then a simple DO loop with an embedded OUTPUT statement will do that.  You should remember to keep the loop counter so that you can preserve the order.

Was there a question in there about how to deal with more than one of these concatenated variables in a single pass through the data?  If so can you give an example of the output you would like?

data want ;

  set have ;

  do i=1 to countw(var,';','m');

    newvar=scan(var,i,';','m');

    output;

  end;

run;

data _null_;

  set want ;

  put key i newvar var ;

run;


1 1 a a;b;c

1 2 b a;b;c

1 3 c a;b;c

2 1 b b;b;b;

2 2 b b;b;b;

2 3 b b;b;b;

2 4   b;b;b;

Anotherdream
Quartz | Level 8

HELLO Tom. You are correct the question is how to put the data going the long direction instead of the wide, however doing it with multiple variables. For example, the following dataset below would below the after dataset if working correctly.

Key     Var1               Var2

1          a;b;c               ;;B

2           b;b;b              C;;D

Becomes

key           var1      Var2

1               a          NULL

1               b          NULL

1               c          B

2               b          C

2               b          NULL

2               b          D

Note that I wrote the word NULL instead of leaving it blank for illustration purposes. Basically I dont know how to loop through each variable in a dataset when you can have 100 different columns you wish to loop through in one data step. It also has to be dynamic, as the column names and numbers can change (and do) daily.

I could theoritically split it up by unique variable and then merge (which is similar to what I'm doing now), but i'm looking for the ability to do it all in one dataset to maximze efficiency.  [To be honest the output method might still be more efficient than the method I have been using, I will perform some tests on this tomorrow.]

Thanks again

Brandon

Tom
Super User Tom
Super User

So just adjust the upper limit of your loop counter to account for maximum over all of the variables that you are converting.

data want ;

  set have ;

  do i=1 to max(countw(var1,';','m'),countw(var1,';','m'));

    newvar1=scan(var1,i,';','m');

    newvar2=scan(var2,i,';','m');

    output;

  end;

run;

Tom
Super User Tom
Super User

You can generalize to as many variables as you want by using arrays.

data have;

  input key var1 $ var2 $ ;

cards4;

1 a;b;c ;;B

2 b;b;b C;;D

;;;;

data want ;

  set have ;

  array old var1-var2;

  do i=1 to dim(old);

    max = max(max,countw(old(i),';','m'));

  end;

  array new $8 newvar1-newvar2;

  do i=1 to max;

    do j=1 to dim(old);

      new(j) = scan(old(j),i,';','m');

    end;

    output;

  end;

  drop j max var1-var2;

  rename newvar1-newvar2 = var1-var2 ;

run;

Obs    key    i    var1    var2

1      1     1     a

2      1     2     b

3      1     3     c       B

4      2     1     b       C

5      2     2     b

6      2     3     b       D

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 738 views
  • 0 likes
  • 3 in conversation