Solved: Re: Count the occurance of several words in a column

Yves_Boonen · Posted 04-07-2016 04:51 AM

I have an Excel sheet with a few dozen columns. One of these columns is called "distribution options". In this column you find general keywords to indicate how it will be distributed. Below you will find all the available "distribution options".

Fast Track Heir

Fast Track Living Expenses

Fast Track Balance

Fast Track Surety

Manual Distribution Office

Distribution Notery

Distribution Standard

I wanted to read out the entire column and receive a table how many times each of these distribution options appear in the column.

The result in the end should look similar to this.

Distribution Options	Count
Fast Track Heir	102
Fast Track Living Expenses	230
Fast Track Balance	17
Fast Track Surety	43
Manual Distribution Office	2
Distribution Notery	18
Distribution Standard	43

I wrote a script, but for some reason it doesn't count the Fast Track Heir, Fast Track Surety and Manual Distribution Office. All the other values (the same as above) are correct.

I used this script, but perhaps there is an easier approach. I simply can't figure out why Heir, Surety and Manual return "0" values, whereas the rest returns the values shown above. There are no errors in the log either. This time I checked the data myself and used COUNTIF(range;"word") to count it in Excel. The values above are correct. The script somehow isn't.

PROC SORT IN=atropos.dossiers OUT=atropos.dos_opt;
BY dossier_manager;
RUN;

DATA atropos.dos_rap;
SET atropos.dos_opt;
BY dossier_manager;
RETAIN
	Heir
	Living_Expenses
	Balance
	Surety
	Manual
  	Notary
	Standard;

IF first.dossier_manager
THEN DO;
	Heir=0;
	Living_Expenses=0;
	Balance=0;
	Surety=0;
	Manual=0;
  	Notary=0;
	Standard=0;
END;

SELECT (distributionoption);

  	WHEN ('Fast Track 1 Heir') Heir +1;

  	WHEN ('Fast Track Living Expenses') Living_Expenses +1;

	WHEN ('Fast Track Balance') Balance +1;
  	
  	WHEN ('Fast Track Surety") Surety +1;

	WHEN ('Manual Distributionassignment Office') Manual +1;

	WHEN ('Distribution Notery') Notery +1;

	WHEN ('Distribution Standard') Standard +1;

  	OTHERWISE;
END;

IF last.dossier_manager THEN OUTPUT;
KEEP dossier_manager Heir Living_Expenses Balance Surety Manual Notary Standard;
RUN;

Loko · Posted 04-07-2016 04:59 AM

Hello,

Are you sure the values are written the way you have added in the select clause ?

By the way select clause is useful when:

- you have a long series of mutually exclusive numeric conditions
- data values are uniformly distributed

In your case I believe an aprroach using if then else condition would be more appropriate.

View solution in original post

Loko · Posted 04-07-2016 04:59 AM

Hello,

Are you sure the values are written the way you have added in the select clause ?

By the way select clause is useful when:

- you have a long series of mutually exclusive numeric conditions
- data values are uniformly distributed

In your case I believe an aprroach using if then else condition would be more appropriate.

Yves_Boonen · Posted 04-07-2016 05:13 AM

Yes, the values are written correctly. I have copy pasted the values from the Excel sheet. I even tried to change

WHEN ('Fast Track 1 Heir') Heir +1;

to

WHEN ('%Heir%') Heir +1;

Even then it didn't work.

So you're suggestion would look like this?

DATA atropos.dos_rap;
SET atropos.dos_opt;
BY dossier_manager;
RETAIN
	Heir
	Living_Expenses
	Balance
	Surety
	Manual
  	Notary
	Standard;

IF first.dossier_manager
THEN DO;
	Heir=0;
	Living_Expenses=0;
	Balance=0;
	Surety=0;
	Manual=0;
  	Notary=0;
	Standard=0;
END;

SELECT (distributionoption);

  	IF distributionoption = ('Fast Track 1 Heir') THEN Heir +1;

  	ELSE IF distributionoption = ('Fast Track Living Expenses') THEN Living_Expenses +1;

	ELSE IF distributionoption = ('Fast Track Balance') THEN Balance +1;
  	
  	ELSE IF distributionoption = ('Fast Track Surety") THEN Surety +1;

	ELSE IF distributionoption = 'Manual Distributionassignment Office') THEN Manual +1;

	ELSE IF distributionoption = ('Distribution Notery') THEN Notery +1;

	ELSE IF distributionoption = ('Distribution Standard') THEN Standard +1;
END;

IF last.dossier_manager THEN OUTPUT;
KEEP dossier_manager Heir Living_Expenses Balance Surety Manual Notary Standard;
RUN;

Yves_Boonen · Posted 04-07-2016 05:54 AM

I tried a different approach.

I imported the file. Then I took the imported data and only selected the distribution options, while storing this new selection as a new table. I noticed that some of the distribution options were incomplete and thus the query didn't work.

For example: "Fast Track 1 Heir" became "Fast Track 1 He" as well as with the other two.

I did copy and paste the exact values from the Excel file, but that wasn't the table my query tried to work with. In fact it was an import in my SAS library. Instead of checking the raw data, I should also check my imported data more. Consider this a lesson learned, because after I adjusted the names in the first script, everything ran smoothly!

Thanks for helping me get there.

RW9 · Posted 04-07-2016 05:22 AM

Hi,

Firstly, please code in consistent casing/indentaiton, makes it much easier to read, and posting example test data in the form of a datastep will help you get tested code.

Now your problem. I thinkk you are looking at it the wrong way round, you have list of what values you want out, and then trying to count from that. The other way of looking at it is to proc means the data you have, and then filtering the results for those you want to keep. This has the benefit of being able to see what counted data is present. For example (untested);

proc means data=have;
  by distribution_options;
  output out=results;
run;
/* Filter out the ones you want to keep */
proc sql;
  create table WANT as
  select  *
  from    DISTRIBUTION_OPTIONS in (select OPTIONS from LIST_OF_VALUES);
quit;

Nice and simple way, though to be honest I don't think you need the second step, just drop records you don't want in your further processing, then its just one step.

Just to add, on your second post, a select clause can look like:

select(<test>);

when(<result>) <code>;

Or ir can look like:

select;

when (index(strip(upcase(<text>)),"HIKE")>0) <code>;

Kurt_Bremser · Posted 04-07-2016 05:54 AM

If this is really EXACTLY your code, look at the log, because you have non-matching quotes in

WHEN ('Fast Track Surety") Surety +1;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Yves_Boonen · Posted 04-07-2016 05:57 AM

No it isn't.

The original code (in my SAS client) has variable names in Dutch. To make it more understandable, I manually changed all the names in here. Kinda screwed me over, but rest assured: no syntax errors. 🙂

Thanks!

LinusH · Posted 04-07-2016 08:07 AM

Not sure I under the purpose of this. Seems to me as a hard coded recode of column values, count, and then a transpose...?

Similar to @RW9, I think there are smarter ways of accomplish this.

A format/lookup table can recode/rename your distribution options.

PROC TABULATE does cross tabulation out of the box - no need to transpose your data set.

Data never sleeps

stat_sas · Posted 04-07-2016 09:05 AM

Another way to do this is using proc sql:

data have;
input distribution_options $40.;
datalines;
Fast Track Heir
Fast Track Living Expenses
Fast Track Balance
Fast Track Surety
Manual Distribution Office
Distribution Notery
Distribution Standard
Fast Track Heir
Fast Track Living Expenses
Fast Track Balance
;

proc sql;
create table want as
select trim(distribution_options) as distribution_options , count(*) as Count
from have
group by trim(distribution_options);
quit;

Haikuo · Posted 04-07-2016 09:51 AM

Maybe I am missing something, but Proc Freq seems to be designed for this kind of job.

Classroom Training Available!