Hi there,
I have a dataset in which each row is having multiple sections. I want to create separate columns for each section. As the delimiters are varying in different rows such as 1), 2) , 3) , 1. 2. 3. A) , B) , C) etc., I am struggling to separate it. Can somebody help me to solve the issue. Thank you in advance for your kind reply. Sas code is given below to give an idea of the existence of records:
DATA TEST;
REPORT= "1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS.
3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS.
4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
A) ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED.
B) ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS.
A. POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND.
B. POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED."
;
RUN;
OK. Add one more statement ( n=0; ) into it.
proc import datafile='/folders/myfolders/test1.xlsx' out=test dbms=xlsx replace;
run;
data temp;
set test;
n=0;
do i=1 to countw(report,' ');
temp=scan(report,i,' ');
if prxmatch('/^(\d+|[a-z])\W$/i',strip(temp)) then n+1;
output;
end;
keep patient_id n temp;
run;
data want;
set temp;
by patient_id n;
length want $ 32767;
retain want;
want=catx(' ',want,temp);
if last.n then do;output;call missing(want);end;
drop temp;
run;
Are you importing this from a text file? Can you correct it in the import step?
Hi Reeza,
Sorry for the inconvenience. What you are seeing is just a single record having 8 lines.
Regards,
Deepak
I understand that. My question is how did you get the data in the first place. Sometimes its easier to manipulate and make changes in the original data import step rather than afterwards.
How do you want your output to look like?
Do all your questions have the exact same structure or do they vary, number of sections/delimiters.
Do any responses have character ")" or can that be used as a delimiter.
Take a look at the SCAN(), SUBSTR() functions for a start.
Hi reeza,
I want each section of my report in a separate column. For e.g. in this record , there are 8 lines, so I want each line in a separate column. The character ")" can that be used as a delimiter. The number of sections/delimiters may vary but I have different types of variation, I came across such as 1) 2) 3) 1. 2. 3. a. b. c. a) b) c).
sorry to provide you a messy record on Friday.
Regards,
Deepak
Your posted data contains linefeed or carriage returns between what I believe you want for your new variables. Does your actual data contain those characters? That would be the easiest thing to split on if so.
Hi Ballardw,
Thank you for your kind reply. Actually this is the real data what you are seeing. If you want to manipulate it to facilitate analysis, please go ahead. I will follow your path. My target is to get separate reports which are existing in a single report in merged form.
Regards,
Deepak
There is an issue that sometimes what is posted is not actually the content of the data after being filtered through the forum paste and copy to my side. I see a linebreak but that could be caused by more than one character and depending on the original file source could be a carriage return/line feed pair of characters, a single line feed, a single carriage return, and possibly even a vertical tab. But when pasting into the SAS editor it "fixes" things so that those characters are treated differently. So testing code is difficult.
I'm sure there is some slick way with regular expressions to do this but I don't work with them often.
This MIGHT help but this is untested because of the issue above.
DATA TEST;
set have; /* your dataset*/
/* assumes there are never more than 10 substatements increase the 10 if that is not the case. The substatments have to be long enough to the
the longest substatement, the 700 should be as long as the length of the REPORT variable*/
length tstr s1-s10 $ 700;
tstr = report; /* going to be slicing up things, don't want to lose orgininal data*/
array s $ s1-s10;/* the 10 matches the 10 in the Length*/
i = 1;
do while (tstr ne '');
pos = anycntrl(tstr); /*look for control characters*/
/* when none then there should not be a line break*/
if pos = 0 then do ;
s[i]=tstr;
tstr='';
end;
/* when a linebreak is found, keep the string up to that position
and remove that portion from the temporary string*/
else do;
s[i]= substr(tstr,1,pos-1);
tstr= substr(tstr,pos+1);
end;
i=i+1;
end;
drop pos i;
run;
Hi Ballardw,
Everytime I go through your reply, I learn something new. I get a new direction to think about my issue. Thank you for your incredible replies.
Using the code provided, I generated the output and attached for your kind consideration to get future direction as I am interested to address this issue using method other than regular expression.
Thank you in advance for your kind reply.
Regards,
Deepak
That is really not easy. There are too many patterns you need to consider about . DATA TEST; REPORT= "1) POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED. 2) POLYP AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 3. ADENOMA AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. 4. ADENOMA FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND. A) ADENOMATOUS TISSUE AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED. B) ADENOMATOUS TISSUE AT RIGHT COLON: INFLAMMATORY. NEGATIVE FOR TUBULOVILLOUS. A. POLYP FROM TRANSVERSE COLON: NEGATIVE FOR INFLAMMATORY BUT SERRATED IS FOUND. B. POLYP AT LEFT COLON: TUBULOVILLOUS. NEGATIVE FOR SERRATED." ; RUN; data temp; set test; do i=1 to countw(report,' '); temp=scan(report,i,' '); if prxmatch('/^(\d+|[a-z])\W$/i',strip(temp)) then n+1; output; end; keep n temp; run; data want; set temp; by n; length want $ 32767; retain want; want=catx(' ',want,temp); if last.n then do;output;call missing(want);end; drop temp; run;
Hi Ksharp,
Kindly accept my apology for the delayed reply. The solution provided by you is simply great ! The different sections present in my report can be separated easily into multiple rows. Is it possible to shift different section to new coulmns from the begining. (inspite of using proc transpose after the code provided by you. I am just curious; otherwise the solution provided for this critical situation is awesome.
Regards,
Deepak
Hi Ksharp,
First of all, I want to thank you for providing me a great solution for a very critical issue related to my dataset having multiple reports merged together into a single report.
I am trying to use the code provided by you to my dataset having multiple patients. I am experiencing difficulty in splitting the reports per patient. I have attached a sample for your kind consideration. Can you kindly guide me further.
Thank you in adavance for your kind guidance.
Regards,
Deepak
OK. no problem. Just add "patient_id" into group variables.
proc import datafile='/folders/myfolders/test1.xlsx' out=test dbms=xlsx replace;
run;
data temp;
set test;
do i=1 to countw(report,' ');
temp=scan(report,i,' ');
if prxmatch('/^(\d+|[a-z])\W$/i',strip(temp)) then n+1;
output;
end;
keep patient_id n temp;
run;
data want;
set temp;
by patient_id n;
length want $ 32767;
retain want;
want=catx(' ',want,temp);
if last.n then do;output;call missing(want);end;
drop temp;
run;
Good morning Ksharp,
Thank you for your incredible reply. For your kind information, I am trying to identify the number of sections existing in each report per person. Using the existing code, I am getting a number which reflects the overall count of sections. For your kind understanding, I have attached an excel sheet reflecting my needs. This is needed to get an idea about the number of reports belonging to each patient as well as different id can be allotted by concatenating patient id with the serial number.
Once again thank you in advance with an expectation to get another invaluable tips from you which will be a life time experience as usual.
Regards,
Deepak
OK. Add one more statement ( n=0; ) into it.
proc import datafile='/folders/myfolders/test1.xlsx' out=test dbms=xlsx replace;
run;
data temp;
set test;
n=0;
do i=1 to countw(report,' ');
temp=scan(report,i,' ');
if prxmatch('/^(\d+|[a-z])\W$/i',strip(temp)) then n+1;
output;
end;
keep patient_id n temp;
run;
data want;
set temp;
by patient_id n;
length want $ 32767;
retain want;
want=catx(' ',want,temp);
if last.n then do;output;call missing(want);end;
drop temp;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.