BookmarkSubscribeRSS Feed
lifelearner2015
Fluorite | Level 6

Dear all,

I know how to split text column (DIAGNOSIS) into separate columns (i.e., col1--col4), shown below in SAS output:

(code in DATA step)

DIAGNOSIS= TRIM(DIAGNOSIS);
col1=scan(DIAGNOSIS, 1, '/' || ',');
col2=scan(DIAGNOSIS, 2, '/' || ',');
col3=scan(DIAGNOSIS, 3, '/' || ',');
col4=scan(DIAGNOSIS, 4, '/' || ',');
col5=scan(DIAGNOSIS, 5, '/' || ',');
run;

 

sas output.PNG

but this is not preferrable. I want the output being like the below, each column having same-type value.

(The following is preferred SAS output)

want output.PNG

 

So how can I do it in SAS? Thanks a lot. 

-b

 

10 REPLIES 10
Reeza
Super User

How many possible different diagnosis can you have? I can see that getting wide very fast.

 

Easies method in SAS is to transpose to a long format and then back to a wide format. 

Transpose using an output statement - add output; after each col calculation you've shown. 

lifelearner2015
Fluorite | Level 6

hi, Reeza,

Thank you for your idea. It's about 8 diagnoses.  

Reeza
Super User

Post some sample data, in an easy to import format (datalines preferable) and someone can demonstrate the code if you need help. 

 

Transpose is the most dynamic solution but probably more steps than a manual solution. If you only have 8, an array with a lookup is also an option, but if you expand the number it can get ugly fast. Plus with the transpose you don't have to worry about the amount or names, it updates automatically. 

lifelearner2015
Fluorite | Level 6

Hi, Reeza, 

The simplified original data is like this:

 

data diag;

input ID$ AGE DIAGNOSIS$;

DATALINES;

1 22 eating disorder/sleeping disorder/seizure/lung problem/depression/panic attack

2 33 lung problem/depression/eating disorder/anxiety/seizure

3 44 sleeping disorder/panic attack/anxiety/lung problem/eating disorder

;

run;

 

ScottBass
Rhodochrosite | Level 12

There are multiple ways of doing this, but the main issues are this:

 

* Your diagnoses are formatted in one long string, with a delimiter separating each diagnosis

(Next time get your data manager to format the data as one row per diagnosis, IMO that would be a better data model)

* You need to get a distinct list of diagnoses across your entire dataset

* The count of distinct diagnoses determines the number of columns to create in your output dataset

* You would like the diagnosis values to "line up" in each column, in alphabetical order

* You want the code to "just work" if the number of distinct diagnoses changes in the future (or you want to apply this same approach to a different study).

 

Here are a few approaches, pick your favourite, apologies for the length and repeated code, it's just a cut-and-paste from my EG session...

 

Double array approach: 

 

data have;
input ID AGE DIAGNOSIS $char100.;
DATALINES;
1 22 eating disorder/sleeping disorder/seizure/lung problem/depression/panic attack
2 33 lung problem/depression/eating disorder/anxiety/seizure
3 44 sleeping disorder/panic attack/anxiety/lung problem/eating disorder
;
run;

* data step transpose of diagnosis ;
data diagnoses;
   set have;
   length diag $100;
   do i=1 to countw(diagnosis,"/");
      diag=scan(diagnosis,i,"/");
      output;
   end;
   drop diagnosis i;
run;

* get distinct list of diagnoses ;
proc sort data=diagnoses (keep=diag) out=diags nodupkey;
   by diag;
run;

* get count of diagnoses ;
data _null_;
   set diags nobs=nobs;
   call symputx("nobs",nobs,"G");
   stop;
run;

* use two arrays ;
data want;
   * load an array of the distinct list of diagnoses ;
   array diags[&nobs] $100 _temporary_;
   do until (eof1);
      set diags end=eof1;
      diags[_n_]=diag;
      _n_+1;
   end;

   * now process the main dataset ;
   do until (eof2);
      set diagnoses end=eof2;
      by id age;
      length diagnosis1-diagnosis&nobs $100;
      array diagnosis[&nobs] diagnosis:;
      if first.id then call missing(of diagnosis:);
      do _n_=1 to dim(diags);
         if diag=diags[_n_] then diagnosis[_n_]=diag;
      end;
      if last.id then output;
   end;
   keep id age diagnosis:;
run;

 

Hash object approach:

 

data have;
input ID AGE DIAGNOSIS $char100.;
DATALINES;
1 22 eating disorder/sleeping disorder/seizure/lung problem/depression/panic attack
2 33 lung problem/depression/eating disorder/anxiety/seizure
3 44 sleeping disorder/panic attack/anxiety/lung problem/eating disorder
;
run;

* data step transpose of diagnosis ;
data diagnoses;
   set have;
   length diag $100;
   do i=1 to countw(diagnosis,"/");
      diag=scan(diagnosis,i,"/");
      output;
   end;
   drop diagnosis i;
run;

* get distinct list of diagnoses ;
proc sort data=diagnoses (keep=diag) out=diags nodupkey;
   by diag;
run;

* store the index with the list ;
data diags;
   set diags end=eof;
   index+1;
   if eof then call symputx("nobs",index,"G");
run;

* use a hash object ;
data want;
   if 0 then set diags;  * sets column attributes ;
   if (_n_=1) then do;
      declare hash h (dataset:"diags");
      h.defineKey("diag");
      h.defineData("index");
      h.defineDone();
   end;
   call missing(of _all_);  * implied retain on "if 0 then set diags" statement ;

   do until (last.id);
      set diagnoses; * this will read in id age diag ;
      by id age;

      * declare the diagnosis columns ;
      length diagnosis1-diagnosis&nobs $100;
      array diagnosis[*] diagnosis:;

      * find the matching diagnosis in the hash ;
      * this will set the array pointer value ;
      h.find();

      * now assign the value to the correct array element ;
      diagnosis[index]=diag;
   end;
   output;
   keep id age diagnosis:;
run;

 

Interleave data approach:

 

data have;
input ID AGE DIAGNOSIS $char100.;
DATALINES;
1 22 eating disorder/sleeping disorder/seizure/lung problem/depression/panic attack
2 33 lung problem/depression/eating disorder/anxiety/seizure
3 44 sleeping disorder/panic attack/anxiety/lung problem/eating disorder
;
run;

* data step transpose of diagnosis ;
data diagnoses;
   set have;
   length diag $100;
   do i=1 to countw(diagnosis,"/");
      diag=scan(diagnosis,i,"/");
      output;
   end;
   drop diagnosis i;
run;

* get distinct list of diagnoses ;
proc sort data=diagnoses (keep=diag) out=diags nodupkey;
   by diag;
run;

* get count ;
data _null_;
   set diags nobs=nobs;
   call symputx("nobs",nobs,"G");
   stop;
run;

* transpose the list ;
proc transpose data=diags out=diags_trans (keep=col:);
   var diag;
run;

* join with the diagnoses dataset and derive results ;
data want;
   format id age;  * set partial PDV order ;
   if (_n_=1) then set diags_trans;
   do until (last.id);
      set diagnoses;
      by id age;
      length diagnosis1-diagnosis&nobs $100;
      array diagnosis[*] diagnosis:;
      array col[*] col:;
      do i=1 to dim(col);
         if diag=col[i] then do;
            diagnosis[i]=diag;
            leave;
         end;
      end;
   end;
   output;
   keep id age diagnosis:;
run;

 

Cartesian product approach:

 

data have;
input ID AGE DIAGNOSIS $char100.;
DATALINES;
1 22 eating disorder/sleeping disorder/seizure/lung problem/depression/panic attack
2 33 lung problem/depression/eating disorder/anxiety/seizure
3 44 sleeping disorder/panic attack/anxiety/lung problem/eating disorder
;
run;

* data step transpose of diagnosis ;
data diagnoses;
   set have;
   length diag $100;
   do i=1 to countw(diagnosis,"/");
      diag=scan(diagnosis,i,"/");
      output;
   end;
   drop diagnosis i;
run;

* get distinct list of diagnoses ;
proc sort data=diagnoses (keep=diag) out=diags nodupkey;
   by diag;
run;

* store the index with the list ;
data diags;
   set diags end=eof;
   index+1;
   if eof then call symputx("nobs",index,"G");
run;

* create the Cartesian product of the diagnoses, which will set the column index ;
* Note: this could perform badly if the data is really large ;
proc sql;
   create table temp as
   select a.id, a.age,a.diag,b.index
   from diagnoses a
   full join
   diags b
   on a.diag=b.diag
   order by id, age;
quit;

* now use the index to line up the columns ;
data want;
   do until (last.id);
      set temp;
      by id age;
      length diagnosis1-diagnosis&nobs $100;
      array diagnosis[*] diagnosis:;
      diagnosis[index]=diag;
   end;
   output;
   keep id age diagnosis:;
run;
  

 


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
Shmuel
Garnet | Level 18

You can create format of poosible diagnosys in desired order, like next code.

Be careful witt case and typos if need.

proc format lib=work;
value $diag
'eating disorder' = '01'
'sleeping disorder' = '02'
'seizure' = '03'
'lung problem' = '05'
'depression' = '06'
'panic attack' = '07'
'anxiety' = '08'
; /* add more lines as mutch as need in desired order */
run;

data have;
input ID AGE DIAGNOSIS $char100.;
DATALINES;
1 22 eating disorder/sleeping disorder/seizure/lung problem/depression/panic attack
2 33 lung problem/depression/eating disorder/anxiety/seizure
3 44 sleeping disorder/panic attack/anxiety/lung problem/eating disorder
;
run;

data want;
set have;
length diag diag1-diag8 $20;
array di $ diag1-diag8;
n = countw(DIAGNOSIS,'/,');
do i=1 to n;
diag = scan(DIAGNOSIS,i,'/,');
j=input(put(diag,$diag2.),2.);
di(j) = diag;
end;
drop i j n diag;
run;

You may get array subscript error in case of diagnosys not defined in the format.

Just add it to the format, addapt array length and rerun.

ScottBass
Rhodochrosite | Level 12

@Shmuel, these are great ideas Smiley Happy!  

 

Using a format to derive the column pointer, plus parsing the original table in place, results in tighter and easier to read code.

 

I would make these minor changes:

 

Since a format:

* always returns character output

* can accept either a numeric or character input

 

and an informat:

* always takes character input

* can return either a numeric or character output

 

We can use a numeric informat instead of a character format.  This simplifies the code to a single call to input instead of the double input(put(...)) construct.

 

I'd make this data driven instead of hard coded, in case the list of diagnoses changes in the future, or the code is to be reused in another study.

 

Here is my latest iteration...thanks for the great ideas!

 

data have;
input ID AGE DIAGNOSIS $char100.;
DATALINES;
1 22 eating disorder/sleeping disorder/seizure/lung problem/depression/panic attack
2 33 lung problem/depression/eating disorder/anxiety/seizure
3 44 sleeping disorder/panic attack/anxiety/lung problem/eating disorder
;
run;

* data step transpose of diagnosis ;
* a view may perform better if the source file is large ;
* since disk I/O is reduced ;
data diagnoses / view=diagnoses;
   set have (rename=(diagnosis=temp));
   length diagnosis $100;
   do i=1 to countw(temp,"/");
      diagnosis=scan(temp,i,"/");
      output;
   end;
   keep diagnosis;
run;

* get distinct list of diagnoses ;
proc sort data=diagnoses out=diagnosis nodupkey;
   by diagnosis;
run;

* create a cntlin dataset for proc format ;
* we want to create a numeric informat ;
data cntlin;
   format fmtname type start end label;  * PDV order ;
   keep fmtname type start end label;
   set diagnosis end=eof;
   fmtname='diagnosis';
   type='I';
   start=diagnosis;
   end=start;
   label+1;
   if eof then call symputx("n",label,"G");
run;

* create informat ;
proc format cntlin=cntlin;
run;

* create the desired output ;
data want;
   set have (rename=(diagnosis=temp));
   length diagnosis diagnosis1-diagnosis&n $100;
   array arrdiags[*] diagnosis1-diagnosis&n;
   do i=1 to countw(temp,"/");
      diagnosis=scan(temp,i,"/");
      arrdiags[input(diagnosis,diagnosis.)]=diagnosis;
   end;
   keep id age diagnosis1-diagnosis&n;
run;

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
Shmuel
Garnet | Level 18

@ScottBass, I like the idea of informat instead of the format.

I have thought of using cntlout in oreder to create the full list and count them,

but there may be some isuues:

- case letters. Diagnoses shold be all either lowcase or uppercase

- typos - should be handeled manually to avoid semi-duplicates/splitting.

- as mutch as I know the START, END variables in a format are limited length. Is it still 16 Characters ?
  how to deal with diagnosis that are longer ?

 

@lifelearner2015, check, do you have any of above issues ? 

- case is the easiest to deal by using sas function lowcase() or upcase() - whatever you preffer.

- you won't find typos if diagnoses are selected from a fixed list and not by typing.
  you can look for typos in the sorted cntlout dataset.

- is there any diagnosys length greater than 16 characters ? 

ScottBass
Rhodochrosite | Level 12

@Shmuel wrote:

@ScottBass, I like the idea of informat instead of the format.

I have thought of using cntlout in oreder to create the full list and count them,

but there may be some isuues:

- case letters. Diagnoses shold be all either lowcase or uppercase

 

SB: The input function does a case-sensitive comparision to the format value. If a case-insenstive comparison is desired, the UPCASE informat option could be used. See http://support.sas.com/documentation/cdl/en/proc/69850/HTML/default/viewer.htm#p1pmw90bl3jzgdn1w4202.... I haven't investigated how to specify UPCASE in an cntlin dataset; reverse engineering a cntlout dataset would probably show how this is done (or the doc).

 

- typos - should be handeled manually to avoid semi-duplicates/splitting.

 

SB: I'm assuming the source data is clean. If there are typos then the typos will be matched, since the informat is exactly derived from the source data. Otherwise @lifelearner2015 will need to clean his data first before creating the informat.

 

- as mutch as I know the START, END variables in a format are limited length. Is it still 16 Characters ?
  how to deal with diagnosis that are longer ?

 

SB: See test code below.
 

@lifelearner2015, check, do you have any of above issues ? 

- case is the easiest to deal by using sas function lowcase() or upcase() - whatever you preffer.

- you won't find typos if diagnoses are selected from a fixed list and not by typing.
  you can look for typos in the sorted cntlout dataset.

- is there any diagnosys length greater than 16 characters ? 


 

I ran this test code.  It seemed to work for me.  @lifelearner2015, use which ever approach you prefer, or try them all as a learning exercise.  Post any further issues as a follow up.

 

data test;
   length long $1000;
   char="ABCDEFGHIJKLmnopqrstuvwxyz";
   long=repeat(char,100);
run;
data cntlin;
   set test (keep=long);
   start=long;
   end=start;
   label=1;
   fmtname="test";
   type="I";
run;
proc format cntlin=cntlin;
run;
data test2;
   set test;
   test1=input(long,test.);
   test2=input(substr(long,1,999),test.);  * does not match, so char[1000] is signficant ;
   test3=input(upcase(long),test.);  * does not match, so case is significant ;
run;

Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.
Shmuel
Garnet | Level 18

I have found some options running proc format.

Next was copied from an O/L documentation:

 

Syntax

PROC FORMAT <option(s)>;

 

Relevant options to issues:

MAXLABLEN=number-of-characters

specifies the number of characters in the informatted or formatted value that you want to appear in the CNTLOUT= data set or in the output of the FMTLIB option. The FMTLIB option prints a maximum of 40 characters for the informatted or formatted value.

 

MAXSELEN=number-of-characters

specifies the number of characters in the start and end values that you want to appear in the CNTLOUT= data set or in the output of the FMTLIB option. The FMTLIB option prints a maximum of 16 characters for start and end values.

 

FMTLIB

prints information about all the informats and formats in the catalog that is specified in the LIBRARY= option. To get information only about specific informats or formats, subset the catalog using the SELECT or EXCLUDE statement.

Interaction: The PAGE option invokes FMTLIB.
Tip: If your output from FMTLIB is not formatted correctly, then try increasing the value of the LINESIZE= system option.
Tip: If you use the SELECT or EXCLUDE statement and omit the FMTLIB and CNTLOUT= options, then the procedure invokes the FMTLIB option and you receive FMTLIB option output.

 

 

Relating to CASE issue - next is proposed code copied from @ScottBass example:

 

data cntlin;
   set test (keep=long);
   start=upcase(long);    /* <<<< upcase function added */
   end=start;
   label=1;
   fmtname="test";
   type="I";
run;

proc format cntlin=cntlin; run; data test2; set test;
long = upcase(long); /* <<< line added to enable upper case comparison */ test1=input(long,test.); test2=input(substr(long,1,999),test.); * does not match, so char[1000] is signficant ; test3=input(upcase(long),test.); * does not match, so case is significant ; run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 2458 views
  • 3 likes
  • 4 in conversation