Solved: Re: Getting frequencies of phrases separated by double spaces

SASbeginner20 · Posted 08-11-2014 02:52 PM

Hi everyone,

I have attached a file of my data. I need some help on how to get frequencies of phrases that are separated by double spaces.

From the attached spreadsheet with diagnoses, the observations can consist of multiple phrases separated by double spaces.

Face: triangular Palpebral fissures: short Ears: low set both Mouth: microstomia Mouth: palate high-arched

I would like to find out how many people in the dataset have “Ears: low set both .” By default, SAS will return a frequency of “Face: triangular Palpebral fissures: short Ears: low set both Mouth: microstomia Mouth: palate high-arched” rather than scan through and see how many people have just “Ears: low set both.”

Thanks!

Tom · Posted 08-12-2014 05:04 PM

I don't see a new file, but in general trying to read a file using two different delimiters is a pain.

You should be able to use read it using both comma and space as a delimiter, but the embedded commas will cause trouble with the & modifier as it will see two delimiters. Also have to add something to deal with quotes.

data want ;

* infile 'diagnoses.csv' truncover firstobs=2;

infile cards dlm=' ,' truncover ;

line+1;

length id $10 gender $1 line 8 loc 8 word next $200.;

input id gender word @;

word = compress(word,'"');

do until(next=' ');

input next & @;

next = compress(next,'"');

if not index(next,':') then word=catx(', ',word,next);

else do;

if word ne ' ' then do; loc=sum(loc,1); output; end;

word=next;

end;

if word ne ' ' then do; loc=sum(loc,1); output; end;

drop next;

cards;

12345,M,Back: lordosis Face: micrognathia

9876,F,"Abdomen: asplenism, Intestinal malrotation Face: micrognathia"

run;

proc print;

run;

You probably would get better results to read the words as one long string and use a DO loop to scan the string for double spaces. The CALL SCAN() function should help with this.

View solution in original post

ballardw · Posted 08-11-2014 03:34 PM

You may need to provide an example of what you want the output to look like. I can come up with several interpretations of "frequencies ... separated by double spaces".

I think you want to read the file as a single line of text and ignore the delimiter, use the _infile_= <some variable name> to create a variable with the line of text. Then use one of the string search functions such as index to look for the specific phrase. You may want to use upcase as well incase the capitalization isn't consistent.

Tom · Posted 08-11-2014 03:47 PM

If your data is in a file then use the & modifier on the INPUT statement.

data want ;

length line 8 loc 8 word $200.;

infile 'diagnoses.csv' truncover firstobs=2;

line+1;

do loc=1 by 1 until(word=' ');

input word & @;

if word ne ' ' then output;

end;

run;

proc freq;

tables word;

run;

SASbeginner20 · Posted 08-11-2014 04:51 PM

Thank you so much Tom! This is amazing.

This is part of the frequency output.

"Back: lordosis	1	0.50
"Face: micrognathia	1	1.00
Abdomen: ascites	1	1.49
Abdomen: asplenism, Intestinal malrotation"	1	1.99

Do you know why there are the quotes in the beginning of those phrases, and quotes at the end of the 4th phrase? There are no such quotes in the dataset. There are more occurrences of these as well later on.

Peter_C · Posted 08-11-2014 05:24 PM

remove those quotes by adding just a little to Tom's code

data want ;

length line 8 loc 8 word $200.;

infile 'diagnoses.csv' truncover firstobs=2;

line+1;

* remove quotes ;

INPUT @ ;

_INFILE_ = TRANSLATE(_INFILE_,' ', '"' ) ;

do loc=1 by 1 until(word=' ');

input word & @;

if word ne ' ' then output;

end;

run;

proc freq;

tables word;

run;

Tom · Posted 08-11-2014 05:36 PM

Quotes are in the data, but when you read it before as a CSV file (or in SAS with the DSD option on the INFILE statement) the quotes were automatically stripped.

Look at the DEQUOTE() function and apply it to the whole input line.

Add these two lines before the DO loop.

input @ ;

_infile_= dequote(_infile_);

SASbeginner20 · Posted 08-12-2014 04:35 PM

Thanks!

I attached a new csv file that includes some member and ID information for each diagnoses. I am really unsure how to add them in using the input statements, as this command is not entirely familiar with me.

Tom · Posted 08-12-2014 05:04 PM

I don't see a new file, but in general trying to read a file using two different delimiters is a pain.

You should be able to use read it using both comma and space as a delimiter, but the embedded commas will cause trouble with the & modifier as it will see two delimiters. Also have to add something to deal with quotes.

data want ;

* infile 'diagnoses.csv' truncover firstobs=2;

infile cards dlm=' ,' truncover ;

line+1;

length id $10 gender $1 line 8 loc 8 word next $200.;

input id gender word @;

word = compress(word,'"');

do until(next=' ');

input next & @;

next = compress(next,'"');

if not index(next,':') then word=catx(', ',word,next);

else do;

if word ne ' ' then do; loc=sum(loc,1); output; end;

word=next;

end;

if word ne ' ' then do; loc=sum(loc,1); output; end;

drop next;

cards;

12345,M,Back: lordosis Face: micrognathia

9876,F,"Abdomen: asplenism, Intestinal malrotation Face: micrognathia"

run;

proc print;

run;

You probably would get better results to read the words as one long string and use a DO loop to scan the string for double spaces. The CALL SCAN() function should help with this.

SASbeginner20 · Posted 08-12-2014 05:10 PM

I actually just replaced the file with the same name "diagnoses.csv"

Ksharp · Posted 08-12-2014 10:40 AM

Another one is DLMSTR= double blanks .

data temp;
 infile 'c:\temp\diagnoses.csv' dlmstr='  ' ;
 input x : $100. @@;
run;

Xia Keshan

Registration is open

SAS Training: Just a Click Away