BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
DeepakSwain
Pyrite | Level 9

Hi there,

I have a dataset in which each row is having multiple sections. I want to create separate columns for each section. As the delimiters are varying in different rows such as 1), 2) , 3) , 1. 2. 3.  A) , B) , C) etc., I am struggling to separate it. Can somebody help me to solve the issue. Thank you in advance for your kind reply. Sas code is given below to give an idea of the existence of records: 

DATA TEST;
REPORT= "1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
A) ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
B) ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
A. POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
B. POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."
;
RUN;
Swain
1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

OK. Add one more statement ( n=0; ) into it.

 

 

proc import datafile='/folders/myfolders/test1.xlsx' out=test dbms=xlsx replace;
run;
data temp;
 set test;
 n=0;
 do i=1 to countw(report,' ');
  temp=scan(report,i,' ');
  if prxmatch('/^(\d+|[a-z])\W$/i',strip(temp)) then n+1;
  output;
 end;
 keep patient_id n temp;
run;
data want;
 set temp;
 by patient_id  n;
 length want $ 32767;
 retain want;
 want=catx(' ',want,temp);
 if last.n then do;output;call missing(want);end;
 drop temp;
 run;

View solution in original post

14 REPLIES 14
Reeza
Super User

Are you importing this from a text file? Can you correct it in the import step?

DeepakSwain
Pyrite | Level 9

Hi Reeza,

 

Sorry for the inconvenience. What you are seeing is just a single record having 8 lines. 

 

Regards,

Deepak

Swain
Reeza
Super User

I understand that. My question is how did you get the data in the first place. Sometimes its easier to manipulate and make changes in the original data import step rather than afterwards.

 

How do you want your output to look like?

Do all your questions have the exact same structure or do they vary, number of sections/delimiters.

Do any responses have character ")" or can that be used as a delimiter. 

 

Take a look at the SCAN(), SUBSTR() functions for a start.

DeepakSwain
Pyrite | Level 9

Hi reeza,

I want each section of my report in a separate column. For e.g. in this record , there are 8 lines, so I want each line in a separate column. The character ")"  can that be used as a delimiter. The number of sections/delimiters may vary but I have different types of variation, I came across such as 1) 2) 3) 1. 2. 3. a. b. c. a) b) c). 

sorry to provide you a messy record on Friday. 

 

Regards,

Deepak

Swain
ballardw
Super User

Your posted data contains linefeed or carriage returns between what I believe you want for your new variables. Does your actual data contain those characters? That would be the easiest thing to split on if so.

DeepakSwain
Pyrite | Level 9

Hi Ballardw, 

 

Thank you for your kind reply. Actually this is the real data what you are seeing. If you want to manipulate it to facilitate analysis, please go ahead. I will follow your path. My target is to get separate reports which are existing in a single report in merged form. 

 

Regards,

Deepak

 

Swain
ballardw
Super User

There is an issue that sometimes what is posted is not actually the content of the data after being filtered through the forum paste and copy to my side. I see a linebreak but that could be caused by more than one character and depending on the original file source could be a carriage return/line feed pair of characters, a single line feed, a single carriage return, and possibly even a vertical tab. But when pasting into the SAS editor it "fixes" things so that those characters are treated differently. So testing code is difficult.

 

I'm sure there is some slick way with regular expressions to do this but I don't work with them often.

This MIGHT help but this is untested because of the issue above.

DATA TEST;
   set have; /* your dataset*/
/* assumes there are never more than 10 substatements increase the 10 if that is not the case. The substatments have to be long enough to the 
the longest substatement, the 700 should be as long as the length of the REPORT variable*/
   length tstr s1-s10 $ 700;   
   tstr = report; /* going to be slicing up things, don't want to lose orgininal data*/
   array s $ s1-s10;/* the 10 matches the 10 in the Length*/
   i = 1;
   do while (tstr ne '');
      pos = anycntrl(tstr); /*look for control characters*/
      /* when none then there should not be a line break*/
      if pos  = 0 then do ;
         s[i]=tstr;
         tstr='';
      end;
      /* when a linebreak is found, keep the string up to that position
      and remove that portion from the temporary string*/
      else do;
         s[i]= substr(tstr,1,pos-1);
         tstr= substr(tstr,pos+1);
      end;
      i=i+1;
   end;
   drop pos i;
run;
DeepakSwain
Pyrite | Level 9

Hi Ballardw,

Everytime I go through your reply, I learn something new. I get a new direction to think about my issue. Thank you for your incredible replies. 

Using the code provided, I generated the output and attached for your kind consideration to get future direction as I am interested to address this issue using method other than regular expression.

 

Thank you in advance for your kind reply.

 

Regards,

Deepak 

Swain
Ksharp
Super User
That is really not easy. There are too many patterns you need to consider about .



DATA TEST;
REPORT= "1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
 2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
 A) ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
 B) ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 
 A. POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
 B. POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."
;
RUN;
data temp;
 set test;
 do i=1 to countw(report,' ');
  temp=scan(report,i,' ');
  if prxmatch('/^(\d+|[a-z])\W$/i',strip(temp)) then n+1;
  output;
 end;
 keep n temp;
run;
data want;
 set temp;
 by n;
 length want $ 32767;
 retain want;
 want=catx(' ',want,temp);
 if last.n then do;output;call missing(want);end;
 drop temp;
 run;

DeepakSwain
Pyrite | Level 9

Hi Ksharp,

 

Kindly accept my apology for the delayed reply. The solution provided by you is simply great ! The different sections present in my report can be separated easily into multiple rows.  Is it possible to shift different section to new coulmns from the begining. (inspite of using proc transpose after the code provided by you. I am just curious; otherwise the solution provided for this critical situation is awesome. 

 

Regards,

Deepak

Swain
DeepakSwain
Pyrite | Level 9

Hi Ksharp,

 

First of all, I want to thank you for providing me a great solution for a very critical issue related to my dataset having multiple reports merged together into a single report. 

 

I am trying to use the code provided by you to my dataset having multiple patients. I am experiencing difficulty in splitting the reports per patient. I have attached a sample for your kind consideration. Can you kindly guide me further.

 

Thank you in adavance for your kind guidance. 

 

Regards,

Deepak

Swain
Ksharp
Super User

OK. no problem. Just add "patient_id" into group variables.

 

proc import datafile='/folders/myfolders/test1.xlsx' out=test dbms=xlsx replace;
run;
data temp;
 set test;
 do i=1 to countw(report,' ');
  temp=scan(report,i,' ');
  if prxmatch('/^(\d+|[a-z])\W$/i',strip(temp)) then n+1;
  output;
 end;
 keep patient_id n temp;
run;
data want;
 set temp;
 by patient_id  n;
 length want $ 32767;
 retain want;
 want=catx(' ',want,temp);
 if last.n then do;output;call missing(want);end;
 drop temp;
 run;
DeepakSwain
Pyrite | Level 9

Good morning Ksharp,

 

Thank you for your incredible reply. For your kind information, I am trying to identify the number of sections existing in each report per person. Using the existing code, I am getting a number which reflects the overall count of sections. For your kind understanding, I have attached an excel sheet reflecting my needs. This is needed to get an idea about the number of reports belonging to each patient as well as different id can be allotted by concatenating patient id with the serial number. 

 

Once again thank you in advance with an expectation to get another invaluable tips from you which will be a life time experience as usual.

 

Regards,

Deepak

 

Swain
Ksharp
Super User

OK. Add one more statement ( n=0; ) into it.

 

 

proc import datafile='/folders/myfolders/test1.xlsx' out=test dbms=xlsx replace;
run;
data temp;
 set test;
 n=0;
 do i=1 to countw(report,' ');
  temp=scan(report,i,' ');
  if prxmatch('/^(\d+|[a-z])\W$/i',strip(temp)) then n+1;
  output;
 end;
 keep patient_id n temp;
run;
data want;
 set temp;
 by patient_id  n;
 length want $ 32767;
 retain want;
 want=catx(' ',want,temp);
 if last.n then do;output;call missing(want);end;
 drop temp;
 run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 5443 views
  • 6 likes
  • 4 in conversation