In SAS 9.3
I'm creating a table in Proc Report and would like to display the values for a variable named "country" with all values (country names and their corresponding counts) displayed unless the country represents <20% of the total count. My dataset is huge, so I've created a sample of the data for illustration.
SAMPLE DATA
PersonID country count
1 Australia 1
2 USA 1
3 Australia 1
4 Canada 1
5 Canada 1
6 Canada 1
5 Germany 1
I created new variable "countrycount" to sum totals for each country.
proc sql /*noprint */;
create table finalsql as
select * , sum(count) as countrycount label='Country Count'
from final
group by country
quit;
to give me output:
PersonID country count countrycount
1 Australia 1 2
2 USA 1 1
3 Australia 1 2
4 Canada 1 3
5 Canada 1 3
6 Canada 1 3
5 Germany 1 1
I would like to show a table where Canada and Australia are displayed, and USA and Germany are collapsed into the 'other' category. I'd like to display the total counts by country, like this:
TABLE
Country Total
Canada 3
Australia 2
Other 2
I am trying to automate report generation, and the arrival data will be changing monthly, so I need a dynamic process to collapse countries into the "other" category. Is there a way to create a format that would essentially work like this? My challenge is to create a format based on the values from two variables, one character and one numeric. OR is there another way to do this?
/*I realize this wouldn't actually work in proc format - hope it helps explains my question*/
proc format
value $country_collapsed
country = country /* do this for all values where countrycount >=20% of total */
country totals <20% = 'other'; /* collapse all countries into 'other' category if countrycount <20% total */
run;
I think this is an example of what you might want. Basically get a summary of the data, created a cntlin dataset for proc format filtered on your value range, feed that to proc format and use. This overwrites the existing, if any format, and recreates it.
proc freq data=sashelp.cars noprint ;
tables cylinders / out=percent;
run;
data cylindercntlin;
set percent;
fmtname = 'Cylinders';
start = cylinders; end=cylinders;
label = 'Other';
Type='N'; /* for character this would be C*/
IF .< percent < 20; /* so only values with the percent less than 20 will be mapped*/
run;
proc format cntlin=cylindercntlin library=work;
run;
proc freq data=sashelp.cars ;
tables cylinders ;
format cylinders Cylinders.;
run;
If proc format is not the only option you have than you can also try creating a view and use the view in your program
example:
proc sql;
create view collapsed as
select personid,country,count,
case country
when 'Canada' then 'Canada'
when 'Australia' then 'Australia'
else 'other' end
from sample;
Thanks, Stevyfargose.
I will try that - I am not tied to proc format, but I do need to figure out how to dynamically create a collapsed "other" category (I don't want to hard code anything because the country names will change from month to month, and so will the counts associated with them).
Thanks, Cynthia. I'm going to try that and the CNTLIN options that you guys have suggested. I appreciate your thoughts on how to do this!
I think this is an example of what you might want. Basically get a summary of the data, created a cntlin dataset for proc format filtered on your value range, feed that to proc format and use. This overwrites the existing, if any format, and recreates it.
proc freq data=sashelp.cars noprint ;
tables cylinders / out=percent;
run;
data cylindercntlin;
set percent;
fmtname = 'Cylinders';
start = cylinders; end=cylinders;
label = 'Other';
Type='N'; /* for character this would be C*/
IF .< percent < 20; /* so only values with the percent less than 20 will be mapped*/
run;
proc format cntlin=cylindercntlin library=work;
run;
proc freq data=sashelp.cars ;
tables cylinders ;
format cylinders Cylinders.;
run;
Thanks, ballardw. I'm going to try this, as well. If I can figure it out, I'll post again with my code to make it work.
I wanted to follow up. I took the cntlin suggested by ballardw and with a little modification, it worked!
The only thing I am still working through is how to force the "Other" and missing categories to sort to the last position while allowing everything else to sort alphabetically (and again, I can't hard code the formatting since the countries will change, more will be added, etc.). I may end up having to manually do that in the end report - I can't figure out an order that will do the trick.
Thank you all for your help and suggestions. Here is my final code:
proc freq data=final noprint;
tables country_of_birth / out=r.percent ;
run;
data countryofbirthcntlin;
set r.percent;
fmtname= 'COB';
start=country_of_birth; end=country_of_birth;
Type='C'; /*this would be N for numeric*/
IF 0< percent < 1 then label='Other* ' ; /*so only values with percent less than 20 will be mapped to the other category*/
ELSE IF percent >=1 then label= country_of_birth;
run;
proc format cntlin=countryofbirthcntlin library=work;
run;
proc report data=countryofbirthcntlin;
column country_of_birth count;
define country_of_birth / group format=$COB.;
define count / 'Number';
rbreak after / summarize page ;
compute after / style={just=l};
endcomp;
run;
If you look at the Proc Format documentation you will see that there are a number of options that we have not addressed in this topic. One thing is to code a value format with different options and see if that helps. Then use the CNTLOUT option on proc format to see what the values set in the format used were so you can incorporate those elements into your format.
Also note that several procs have ORDER = directives that may interact with formats. For instance the DEFINE block supporst ORDER = with options of Data, Formatted, Freq and Internal. Without a specific instruction the default is Formatted, which will generally be the alphabetic order of your country format. So you would Lithuania, Other, Poland.
For Proc report you could either insure that the order of the data read into the procedure is the order you want and use the Order=Data option or go another step to create a variable that would be in the sequence you want, probably a numeric, and assign that sequence variable an equivalent and use the Order=internal option.
The steps above to get the format could also get you an informat, Invalue in proc format, to transform the name to a sequence variable.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.