BookmarkSubscribeRSS Feed
CathyVI
Pyrite | Level 9

Hello,

 

Am working on a multiple year dataset (2006-2022). Each year has over 60 million records so I cant merge/join these files.

I want to find freq/count(%) of some variables in each year of data.

I am using proc sql to count the variable one by one which is less efficient.

 

proc sql;
    select count(rectype) as total_count
    from rx_2006 ;
quit;

1. How can i generate the freq and percentage for multiple variables (e.g. rectype, precsb, inced )  separately for each years without joining the files?

2. How can I find the overall freq and percentage of each variable across all the years combined (2006-2020) without joining the files.

Thanks

4 REPLIES 4
Kurt_Bremser
Super User

If the datasets have the same structure, you don't want to join or merge them, you want to stack them, either with a SQL UNION or a DATA step SET.

You can define a DATA step view containing all the datasets, and the run PROC FREQ on the view.

ballardw
Super User

@CathyVI wrote:

Hello,

 

Am working on a multiple year dataset (2006-2022). Each year has over 60 million records so I cant merge/join these files.

Why NOT? do you a have physical limit to the work space you have?

 

If you don't want to combine them then a macro to call each set with Proc FREQ. You don't say whether you want a data set or a report or even the set names.

IF your data sets have names that end in the year then something like:

%macro simplefreq (startyear=2006 , endyear=2022 );

%do i= &startYear. %to &endyear. ;
   Title "Summary Counts for Year &i.";
   proc freq data=yourset&i.;
   run;
%end;
%mend;

%simplefreq;

will generate what you request. If the years are two digits and not, then be glad you aren't crossing the century boundary but you have to deal with constructing a different year variable because the loop counters won't preserve a leading 0.

 

Warning: This is a LOT of output and depending on the number of variables/values per variable is quite likely to exceed the result window capacity. I would

1) Close the default ODS destination (html most likely but some folks apparently still use the Listing destination

2) modify the code in the loop to write each year of data to a separate RTF or PDF document:

    place something like this around the title and Proc Freq code.

    ods rtf file="<my path>\report_&i..rtf";

    ods rtf close;

   Do pay attention to the two . in the file name. The first . is used by the Macro processor to append non-macro values to the value of the &i variable and the second is the actual dot used in the file name by the operating system.

 

 

CathyVI
Pyrite | Level 9

@ballardwThank you. Your response answered my first question. To answer your question about limited space... YES, I have limited space for such enormous data.


Any suggestion for question 2?

2. How can I find the overall freq and percentage of each variable across all the years combined (2006-2020) without joining the files.

ballardw
Super User

@CathyVI wrote:

@ballardwThank you. Your response answered my first question. To answer your question about limited space... YES, I have limited space for such enormous data.


Any suggestion for question 2?

2. How can I find the overall freq and percentage of each variable across all the years combined (2006-2020) without joining the files.


2: Create data sets with the summary counts.

   combine those data sets

   use the combined data with proc freq using the Existing Freq variable as a FREQ variable in proc freq.

However the "easy" data set to create from Proc Freq for multiple variables is using the ODS OUTPUT ONEWAYFREQ table that has a different structure.

Make the sets with something like. The below code is using the option OBS= 10 to only use 10 observations from each set so that you can 1) test the code in relatively short time and 2) look at the output data sets created.

%macro simplefreq (startyear=2006 , endyear=2022 );

%do i= &startYear. %to &endyear. ;
   Title "Summary Counts for Year &i.";
   ods output onewayfreqs= freqs&i. ;
   proc freq data=yourset&i (obs=10).;
   run;
%end;
%mend;

%simplefreq;

Really, look at the output sets and then ask about how to the next step.

Caveat: if you have any formats that group values this output uses the FORMATTED values to count. If you do not want that behavior you need to clear the formats for the duration of the Proc freq. But that may mean you don't recognize values, especially with dates, times and datetimes.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 569 views
  • 2 likes
  • 3 in conversation