BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MHines
Obsidian | Level 7

Hello,

This is my first time posting -- please excuse me if it is confusing. I am currently reading a text file into SAS. Because the data was character delimited but follows an inconsistent pattern, I originally read the text file into SAS and saved it as a SAS file such that all of the variables and associated data within the delimiters was saved as a single variable (entered here as Var_umbrella). I am now attempting to split the catch-all variable into multiple columns. The original file is delimited with the vertical bar " | ", but the variable names are included inside the delimiter. The complicating factor is that each row/observation does not include all of the variables; if the variable was blank for that row, it was omitted from the data entirely for that row (that is not something I can change, unfortunately. The data was given to me as-is and the source will not make adjustments to it). This is what it looks like:

Case #       Var_umbrella

Row 1        |Var_1 a | Var_2  b | Var_3 c | Var_4 d | Var_5 e |

Row 2        |Var_1 f | Var_4 g |

Row 3        |Var_2 h | Var_3 i | Var_5 j |

Row 4        |Var_5 k|

 

This is what I am trying to get it to look like:

Obs #     Var_1     Var_2     Var_3    Var_4    Var_5

Row 1      a            b             c            d            e

Row 2      f             .               .            g            .

Row 3      .             h              i             .             j

Row 4      .             .               .             .            k

 

I tried using the index and =scan(var,n,'delimiter') functions as follows to create the new columns:

 

data want;

     data have;

     if index(Var_umbrella,'Var_1')>0 then Var_1=scan(Var_umbrella,1,'|');

     if index(Var_umbrella,'Var_2')>0 then Var_2=scan(Var_umbrella,2,'I');

     if index(Var_umbrella,'Var_3')>0 then Var_3=scan(Var_umbrella,3,'I');

     if index(Var_umbrella,'Var_4')>0 then Var_4=scan(Var_umbrella,4,'|');

     if index(Var_umbrella,'Var_5')>0 then Var_5=scan(Var_umbrella,5,'I');

run;

 

This worked perfectly for the first variable but does not read in the following variables correctly because the position of each variable (n) is inconsistent across the rows. Please advise! Thank you for any input, and I am happy to provide additional clarification as necessary.

-Miranda

1 ACCEPTED SOLUTION

Accepted Solutions
MHines
Obsidian | Level 7

Thank you all very much for your feedback! Unfortunately I was not able to get the array recommendations to work, but transposing the data did work. I am not permitted to share the data publicly -- that is why I did not post it, and I absolutely understand that not seeing the data made answering my question much more complicated. This is what the code I ended up using:

 

Var_umbrella = trim(left(substr(Var_umbrella,2)));

seq=0;

 

do while(Var_umbrella ~= " ");
if index(Var_umbrella, "|" )>0 & Var_umbrella ~in("|","||") then do;
piece = trim(left(substr(Var_umbrella,1,index(Var_umbrella,"|")-1)));
Var_umbrella = trim(left(substr(Var_umbrella,index(Var_umbrella,"|")+1)));
seq=seq+1;
output;
end;
else if Var_umbrella in("|","||") then Var_umbrella=" ";
else do;
piece = trim(left(Var_umbrella));
Var_umbrella = " ";
seq=seq+1;
output;
end;
end;
run;

 

data want_1;

set data have;

length vname $50 value $500;

vname = trim(left(substr(piece,1,index(piece," ")-1)));
value = trim(left(substr(piece,index(piece," ")+1)));
run;

 

proc sort data=want_1;
by case;
run;


data want_2;
set want_1;
by case;
retain seq_off;
if first.case=1 then seq_off=1;
else if vname="Var_1" then seq_off = seq_off+1;
run;

 

proc transpose data=want_most out=want_table;
by case;
id vname;
var value;
run;

 

View solution in original post

7 REPLIES 7
Reeza
Super User

Instead of this type of approach, I would recommend using a loop to output each variable to a line and then use PROC TRANSPOSE to rearrange it.

Rough idea, untested code:

 

data long;
set have;

num_terms= countc(var_umbrella, "|");

do i=1 to num_terms;
term = scan(var_umbrella, i, "|");
varName = scan(term, 1);
varValue = scan(term, 2);
output;
end;

keep case varValue VarName term;
run;

proc transpose data=long out=wide;
by case;
id varName;
var varValue;
run;

@MHines wrote:

Hello,

This is my first time posting -- please excuse me if it is confusing. I am currently reading a text file into SAS. Because the data was character delimited but follows an inconsistent pattern, I originally read the text file into SAS and saved it as a SAS file such that all of the variables and associated data within the delimiters was saved as a single variable (entered here as Var_umbrella). I am now attempting to split the catch-all variable into multiple columns. The original file is delimited with the vertical bar " | ", but the variable names are included inside the delimiter. The complicating factor is that each row/observation does not include all of the variables; if the variable was blank for that row, it was omitted from the data entirely for that row (that is not something I can change, unfortunately. The data was given to me as-is and the source will not make adjustments to it). This is what it looks like:

Case #       Var_umbrella

Row 1        |Var_1 a | Var_2  b | Var_3 c | Var_4 d | Var_5 e |

Row 2        |Var_1 f | Var_4 g |

Row 3        |Var_2 h | Var_3 i | Var_5 j |

Row 4        |Var_5 k|

 

This is what I am trying to get it to look like:

Obs #     Var_1     Var_2     Var_3    Var_4    Var_5

Row 1      a            b             c            d            e

Row 2      f             .               .            g            .

Row 3      .             h              i             .             j

Row 4      .             .               .             .            k

 

I tried using the index and =scan(var,n,'delimiter') functions as follows to create the new columns:

 

data want;

     data have;

     if index(Var_umbrella,'Var_1')>0 then Var_1=scan(Var_umbrella,1,'|');

     if index(Var_umbrella,'Var_2')>0 then Var_2=scan(Var_umbrella,2,'I');

     if index(Var_umbrella,'Var_3')>0 then Var_3=scan(Var_umbrella,3,'I');

     if index(Var_umbrella,'Var_4')>0 then Var_4=scan(Var_umbrella,4,'|');

     if index(Var_umbrella,'Var_5')>0 then Var_5=scan(Var_umbrella,5,'I');

run;

 

This worked perfectly for the first variable but does not read in the following variables correctly because the position of each variable (n) is inconsistent across the rows. Please advise! Thank you for any input, and I am happy to provide additional clarification as necessary.

-Miranda


 

ballardw
Super User

See if this helps.

data have;
   informat case $10. var_umbrella $100.;
   infile datalines dlm='#';
   input case Var_umbrella;
datalines;
Row 1#|Var_1 a | Var_2  b | Var_3 c | Var_4 d | Var_5 e |
Row 2#|Var_1 f | Var_4 g |
Row 3#|Var_2 h | Var_3 i | Var_5 j |
Row 4#|Var_5 k|
;

Data temp;
   set have;
   array var_ {5} $ 5 ;

   do i=1 to countc(var_umbrella,'|');
      word = scan(var_umbrella,i,'|');
      if not missing(word) then do;
         number = input(scan(word,2,'_ '),best.);
         var_[number]= scan(word,2,' ');
      end;
   end;
   drop i word number;
run;

The data step is just to have something I can test code with. Something similar is how you should share the current data.

This works with several assumptions:

1) you know how many var_xx variables there should be. The value of XX would go in the { } on the array statement.

2) It assumes that the desired Var_xx are sequential  var_1, Var_2, Var_3 ... Var_10 if there are 10 variables.

3) there will always be a single _ followed by a numeric value in the Var_ names.

4) there is always a space, and only a space, between the Var_xx and the value.

Arrays can hold only one type of variable so I am assuming all are character. If you later need to convert one or more of the Var_xx to numeric that will need to be separate step.

The code pulls out one value between | and parses the first bit for the number after the _, then uses that value to assign the correct numbered array element with the second bit of value

The number after the $ on the Array statement is the length of the variable. If you need longer than 5 characters increase the value.

 

If possible you might consider copying a few lines from the original text file as received and pasting them into a text box opened on the forum with the </> icon.

Sometimes we can spot things that allow reading the data in a nicer fashion. Also when I say as received I am I assuming the file was a plain text file. If this was later read and saved from a program like Excel then all bets are off.

 

One lesser used option is NAMED input which works when the data is presented as

Var_1=a   Var_2=c  Var_5=b

It is possible to create a very long ugly single expression without creating the Word and Number temporary variables but those make it easier to follow the code and by not dropping them you can see how things flow a little better

SASJedi
SAS Super FREQ

I'd use an array:

data have;
   infile datalines truncover;
   input;
   length Var_umbrella $60;
   Var_umbrella=strip(_infile_);
datalines;
|Var_1 a | Var_2  b | Var_3 c | Var_4 d | Var_5 e |
|Var_1 f | Var_4 g |
|Var_2 h | Var_3 i | Var_5 j |
|Var_5 k|
;


data want;
   set have;
   /* Creates variables Var_1-Var_5 as single character */
   array V[5] $1 Var_1-Var_5 ;
   /* Scan first name-value pair*/
   _i=1;
   _t=scan(var_umbrella,_i,'|');
   /* Extract name and value as long as there are more avialable */
   do while (not missing(_t));
     /* Extract the name from this name-value pair*/
      _name=scan(_t,1,' ');
      do _j=1 to dim(v);
        /* Find the array element that matches the name and assign the value */
        if vname(v[_j]) = _name then v[_j]=scan(_t,-1,' ');
      end;
      /* Scan next name-value pair*/
     _i+1;
     _t=scan(var_umbrella,_i,'|');
    end;
    drop _: ;
run;

 This is the result. You can drop the var_umbrella - I just left it for you to see the code worked:

Obs Var_umbrella Var_1 Var_2 Var_3 Var_4 Var_5
1 |Var_1 a | Var_2 b | Var_3 c | Var_4 d | Var_5 e | a b c d e
2 |Var_1 f | Var_4 g | f     g  
3 |Var_2 h | Var_3 i | Var_5 j |   h i   j
4 |Var_5 k|         k

 

In the future, you can use this same technique when reading in the original text file, and produce the result you want directly on the first pass:

 

/* make temporary file with the original text in it for testing */
filename myText temp;
data _null_;
   file myText;
   set have;
   put var_umbrella;
run;

/* Print the text file to the log so you can see it */
data _null_;
   infile myText;
   input ;
   put _infile_;
run;

/*Just read it straight into the correct variables from the input buffer! */
data want1;
   infile myText truncover;
   input;
   array V[5] $1 Var_1-Var_5 ;
   /* Scan first name-value pair*/
   _i=1;
   _t=scan(_infile_,_i,'|');
   /* Extract name and value as long as there are more avialable */
   do while (not missing(_t));
     /* Extract the name from this name-value pair*/
      _name=scan(_t,1,' ');
      do _j=1 to dim(v);
        /* Find the array element that matches the name and assign the value */
        if vname(v[_j]) = _name then v[_j]=scan(_t,-1,' ');
      end;
      /* Scan next name-value pair*/
     _i+1;
     _t=scan(_infile_,_i,'|');
    end;
    drop _: ;
run;
This is the result:

 

Obs Var_1 Var_2 Var_3 Var_4 Var_5
1 a b c d e
2 f     g  
3   h i   j
4         k

 

Check out my Jedi SAS Tricks for SAS Users
MHines
Obsidian | Level 7

Hi! Thank you for the quick response. I stopped at 4 lines of data for illustrative purposes; I have over 11,000,000 rows and most of the variable observations are free text entry, so the data itself is inconsistent on top of the variables being inconsistent. Do you have a recommendation for adapting the array approach to a larger dataset?

Reeza
Super User

The number of observations is irrelevant to @SASJedi's or @ballardw approach, but mine may fail as it will generate a very big data set.

I'm not sure what you mean by free text entry, but the solutions provide work for the use case provided. If they don't work, please update your use case to be more reflective of your actual data. We only have the exact information you post to suggest a solution.

ballardw
Super User

@MHines wrote:

Hi! Thank you for the quick response. I stopped at 4 lines of data for illustrative purposes; I have over 11,000,000 rows and most of the variable observations are free text entry, so the data itself is inconsistent on top of the variables being inconsistent. Do you have a recommendation for adapting the array approach to a larger dataset?


Note that I mentioned my assumptions. If those do not hold, such as the naming/numbering of variables then you need to provide that. I can make other suggestions but without seeing a concrete example then there isn't much point in attempting to address those.

The number records does not matter as @Reeza said, unless, and this a big unless, you have managed to split one row across multiple SAS data set observations. If your data was trully "free text" and might have embedded end of line characters and/or linefeed characters (depends a lot on file source) then that may well be the case and headaches ensue. If that "free text"  contains unquoted | characters that is another issue.

 

The modifications to my approach would likely be the Length of the individual variables and I already mentioned how to do that.

 

Another option I should have mentioned earlier if at all practical: Go the source of the file and have them create something that makes sense, such as an honest to coding character delimited file with column headers where you might have a bunch of ||||| indicating missing data.

 

If this data originated as some sort of Report table text file, then request the data used to create the "table" version.

 

I still think it may be worth pasting a few lines of the actual source file. As I said earlier, there are other ways to read data directly. It just might be in a manner you haven't seen yet. Some of us have dealt with poor "data" files in lots of odd manners and have a number of tricks involved in parsing poor structures. But we need to see some to see what may work.

 

MHines
Obsidian | Level 7

Thank you all very much for your feedback! Unfortunately I was not able to get the array recommendations to work, but transposing the data did work. I am not permitted to share the data publicly -- that is why I did not post it, and I absolutely understand that not seeing the data made answering my question much more complicated. This is what the code I ended up using:

 

Var_umbrella = trim(left(substr(Var_umbrella,2)));

seq=0;

 

do while(Var_umbrella ~= " ");
if index(Var_umbrella, "|" )>0 & Var_umbrella ~in("|","||") then do;
piece = trim(left(substr(Var_umbrella,1,index(Var_umbrella,"|")-1)));
Var_umbrella = trim(left(substr(Var_umbrella,index(Var_umbrella,"|")+1)));
seq=seq+1;
output;
end;
else if Var_umbrella in("|","||") then Var_umbrella=" ";
else do;
piece = trim(left(Var_umbrella));
Var_umbrella = " ";
seq=seq+1;
output;
end;
end;
run;

 

data want_1;

set data have;

length vname $50 value $500;

vname = trim(left(substr(piece,1,index(piece," ")-1)));
value = trim(left(substr(piece,index(piece," ")+1)));
run;

 

proc sort data=want_1;
by case;
run;


data want_2;
set want_1;
by case;
retain seq_off;
if first.case=1 then seq_off=1;
else if vname="Var_1" then seq_off = seq_off+1;
run;

 

proc transpose data=want_most out=want_table;
by case;
id vname;
var value;
run;

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 2747 views
  • 5 likes
  • 4 in conversation