Hi everyone,
I have attached a file of my data. I need some help on how to get frequencies of phrases that are separated by double spaces.
From the attached spreadsheet with diagnoses, the observations can consist of multiple phrases separated by double spaces.
Face: triangular Palpebral fissures: short Ears: low set both Mouth: microstomia Mouth: palate high-arched
I would like to find out how many people in the dataset have “Ears: low set both .” By default, SAS will return a frequency of “Face: triangular Palpebral fissures: short Ears: low set both Mouth: microstomia Mouth: palate high-arched” rather than scan through and see how many people have just “Ears: low set both.”
Thanks!
I don't see a new file, but in general trying to read a file using two different delimiters is a pain.
You should be able to use read it using both comma and space as a delimiter, but the embedded commas will cause trouble with the & modifier as it will see two delimiters. Also have to add something to deal with quotes.
data want ;
* infile 'diagnoses.csv' truncover firstobs=2;
infile cards dlm=' ,' truncover ;
line+1;
length id $10 gender $1 line 8 loc 8 word next $200.;
input id gender word @;
word = compress(word,'"');
do until(next=' ');
input next & @;
next = compress(next,'"');
if not index(next,':') then word=catx(', ',word,next);
else do;
if word ne ' ' then do; loc=sum(loc,1); output; end;
word=next;
end;
end;
if word ne ' ' then do; loc=sum(loc,1); output; end;
drop next;
cards;
12345,M,Back: lordosis Face: micrognathia
9876,F,"Abdomen: asplenism, Intestinal malrotation Face: micrognathia"
run;
proc print;
run;
You probably would get better results to read the words as one long string and use a DO loop to scan the string for double spaces. The CALL SCAN() function should help with this.
You may need to provide an example of what you want the output to look like. I can come up with several interpretations of "frequencies ... separated by double spaces".
I think you want to read the file as a single line of text and ignore the delimiter, use the _infile_= <some variable name> to create a variable with the line of text. Then use one of the string search functions such as index to look for the specific phrase. You may want to use upcase as well incase the capitalization isn't consistent.
If your data is in a file then use the & modifier on the INPUT statement.
data want ;
length line 8 loc 8 word $200.;
infile 'diagnoses.csv' truncover firstobs=2;
line+1;
do loc=1 by 1 until(word=' ');
input word & @;
if word ne ' ' then output;
end;
run;
proc freq;
tables word;
run;
Thank you so much Tom! This is amazing.
This is part of the frequency output.
"Back: lordosis | 1 | 0.50 |
---|---|---|
"Face: micrognathia | 1 | 1.00 |
Abdomen: ascites | 1 | 1.49 |
Abdomen: asplenism, Intestinal malrotation" | 1 | 1.99 |
Do you know why there are the quotes in the beginning of those phrases, and quotes at the end of the 4th phrase? There are no such quotes in the dataset. There are more occurrences of these as well later on.
remove those quotes by adding just a little to Tom's code
data want ;
length line 8 loc 8 word $200.;
infile 'diagnoses.csv' truncover firstobs=2;
line+1;
* remove quotes ;
INPUT @ ;
_INFILE_ = TRANSLATE(_INFILE_,' ', '"' ) ;
do loc=1 by 1 until(word=' ');
input word & @;
if word ne ' ' then output;
end;
run;
proc freq;
tables word;
run;
Quotes are in the data, but when you read it before as a CSV file (or in SAS with the DSD option on the INFILE statement) the quotes were automatically stripped.
Look at the DEQUOTE() function and apply it to the whole input line.
Add these two lines before the DO loop.
input @ ;
_infile_= dequote(_infile_);
Thanks!
I attached a new csv file that includes some member and ID information for each diagnoses. I am really unsure how to add them in using the input statements, as this command is not entirely familiar with me.
I don't see a new file, but in general trying to read a file using two different delimiters is a pain.
You should be able to use read it using both comma and space as a delimiter, but the embedded commas will cause trouble with the & modifier as it will see two delimiters. Also have to add something to deal with quotes.
data want ;
* infile 'diagnoses.csv' truncover firstobs=2;
infile cards dlm=' ,' truncover ;
line+1;
length id $10 gender $1 line 8 loc 8 word next $200.;
input id gender word @;
word = compress(word,'"');
do until(next=' ');
input next & @;
next = compress(next,'"');
if not index(next,':') then word=catx(', ',word,next);
else do;
if word ne ' ' then do; loc=sum(loc,1); output; end;
word=next;
end;
end;
if word ne ' ' then do; loc=sum(loc,1); output; end;
drop next;
cards;
12345,M,Back: lordosis Face: micrognathia
9876,F,"Abdomen: asplenism, Intestinal malrotation Face: micrognathia"
run;
proc print;
run;
You probably would get better results to read the words as one long string and use a DO loop to scan the string for double spaces. The CALL SCAN() function should help with this.
I actually just replaced the file with the same name "diagnoses.csv"
Another one is DLMSTR= double blanks .
data temp; infile 'c:\temp\diagnoses.csv' dlmstr=' ' ; input x : $100. @@; run;
Xia Keshan
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.