BookmarkSubscribeRSS Feed
Shmuel
Garnet | Level 18

Can you sort the proc freq output by variable name and its frequency ?

Assuming YES:

 

- Select first 5 rows (maybe for some variables there are less than 5) - per variable name

  and transpose it (may be done by retaining array value1-value5 freq1-freq5) 

  geting output one line paer variable.

 

- merge it with proc contents (or seql dictionary.columns) output

 

- create your report.

Uknown_user
Quartz | Level 8

I understand what you are saying but is it possible for you to post your code here that would return desired output for sample data I posted here? Assuming it can be in 3 single steps: sort the output of proc freq (select just 5 records), transpose it somehow and finally merge it with proc contents?

Reeza
Super User

What about ties?

 

data demonstrate;
set loop;
if mod(i,2) = 0 then _b = 'value1';
else _b = 'value2';
_x = floorz(abs(rand("NORMAL")*2))+1;
if _x > length(_b) then _x = length(_b)-1;
if _x = 0 then _x = 1;
_y = floorz(abs(rand("NORMAL")*2))+1;
if _y + _x > length(_b) then _y = 1;
_c = substr(_b,_y,_x);
d = cats(of _:);
run;

ods table onewayfreqs=temp;
proc freq data=demonstrate;
	table _all_;
run;

*Format output;
data want;
length variable $32. variable_value $50.;
set temp;
Variable=scan(table, 2);

Variable_Value=strip(trim(vvaluex(variable)));

keep variable variable_value frequency percent cum:;
label variable='Variable' 
	variable_value='Variable Value';
run;

proc sort data=want;
by variable descending frequency;
run;

data want2;
set want;
by variable;
if first.variable then count=0;
count+1;
if count<=5;
run;
Uknown_user
Quartz | Level 8

Thanks for your suggestion, the data needs to be transposed and afterwards merged with proc contents. It is still not clear to me how to get the information about completeness (i.e. total missing obs and total non-missing obs). Thanks

Reeza
Super User

@Uknown_user wrote:

Thanks for your suggestion, the data needs to be transposed and afterwards merged with proc contents.


That should be within your skill sets. 

 

It is still not clear to me how to get the information about completeness (i.e. total missing obs and total non-missing obs). Thanks

 

You realize this is the first time you've mentioned this requirement?

 

There are macro's on here that illustrate how to calculate all missing/completeness report. 

 

 

Have you tried the CHARACTERIZE DATA task in EG? It doesn't do a lot of this already? (Tasks>Describe>Characterize)

 

PROC FREQ can do the missing summary, see this example here:

https://gist.github.com/statgeek/2de1faf1644dc8160fe721056202f111

 

 

The remaining information you get from PROC CONTENTS or SASHELP.VCOLUMN. I think you have all the components now, you need to figure out how to incorporate/combine them all into what you want as a final output.

 

Good Luck.

 

 

Uknown_user
Quartz | Level 8

Thanks for your suggestions. I never tried build-in macros, let us give a try (already started execution of this one). I have adopted the solution posted on github within my script - thanks for providing me with the link. Is there any way how to call these build-in macros on datasets created on a go (within a loop for example)?

Shmuel
Garnet | Level 18

It seems that the code @Reeza posted is simmlar to the logic I prposed, so next code is based on @Reeza's code

with some changes:

1) I have added to the 1st step, creating the demostarate dataset some missing values.

2) I have added to the last step creation of the arrays - most 5 frequncies and ther values, 

    amount of missing values and of total values per variable.

data demonstrate;
do i=1 to 1000;
if mod(i,2) = 0 then _b = 'value1';
else _b = 'value2';
_x = floorz(abs(rand("NORMAL")*2))+1;
if _x > length(_b) then _x = length(_b)-1;
if _x = 0 then _x = 1;

if i in(10,30,100,400) then _x = .;   /* added some missing values */
_m = .;                               /* added new variable missing at all obs. */
_y = floorz(abs(rand("NORMAL")*2))+1;
if _y + _x > length(_b) then _y = 1;
_c = substr(_b,_y,_x);
d = cats(of _:);
output;
end;
run;

ods table onewayfreqs=temp;
proc freq data=demonstrate(drop=i); /* drop unique */
	table _all_ / missing nopercent nocum;
run;

data want;
length variable $32. variable_value $50.;
set temp;
Variable=upcase(strip(scan(table, 2)));

Variable_Value=strip(trim(vvaluex(variable)));

keep variable variable_value frequency;
label variable='Variable' 
	variable_value='Variable Value';
run;

proc sort data=want;
by variable descending frequency;
run;

data want2;
set want;
by variable;
retain val1-val5 frq1-frq5 misfrq totfreq;
array valx $ val1-val5;
array frqx frq1-frq5;

if first.variable then do;
   count=0; misfrq=0; totfreq=0;;
   do i=1 to 5;
      call missing(valx(i));
      call missing(frqx(i));
   end;
end;

totfreq = sum(of totfreq, frequency);
if upcase(variable) = '_X' then put totfreq=;

if variable_value in(' ' '.')  
then misfrq=frequency;

count+1;
if count le 5 then do;;
   valx(count) = variable_value;
   frqx(count) = frequency;
end;
if last.variable then output;
drop i count variable_value frequency;
run;







Uknown_user
Quartz | Level 8

Thank you for your help. I have adopted solution with build-in macro "characterize" and after few modifications, it worked (ran sucessfully for all countries). Your code works on the sample data, I will try it on the actual data and measure run times. Thanks again!

Rick_SAS
SAS Super FREQ

When you use PROC FREQ on a table, the default behavior is to print the results. Since your input data has 22 million rows, is it possible the problem is that the variable has many unique values and that PROC FREQ is trying to create an HTML file that contains 105 tables, some of which have millions of rows?  Try using the NOPRINT option on PROC FREQ and use the OUT= option to create a data set of the results. Then you can use PROC CONTENTS to see how many rows are in the output data set.

 

Also, replace the _ALL_ keyword with a particular variable name that you know has only a few levels, such as SEX or GENDER.

Uknown_user
Quartz | Level 8

Thanks for your contribution to this post. Well yes, the default behaviour is to print a html file, however, if this exceeds limit, it simply pups up a window (when the script finishes execution) that asks if I wish to display the results or not, by clicking no, it displays nothing. Unfortunatelly, the html issue is not my case.

 

The error returned by SAS indicated that it stopped after hundreds of thousands records of given variable (originally not posted here) which provides one with clear information where the problem was. It cannot bear that many unique values for given variable, after exlusion of 4 variables (that were either unique or nearly unique) the script ran successfully. This is the example of returned error:

 

Capture.PNG

 

Do you think it will help if I break these subsets column-wise and loop through so it process just few variables at once?

Reeza
Super User

Different question:

Does it make sense to have the most frequent five values for variables with that many unique values? I suspect not...I suspect in those case a statistical summary is more effective, ie mean/median/range etc.

Uknown_user
Quartz | Level 8

I excluded these variables from the script thought.

Rick_SAS
SAS Super FREQ

The first part of your program applies formats to find the number of missing/nonmissing values.  You seem to have used the tips in the article "Count the number of missing values for each variable."  If you link to references that you used, that helps us understand what you are trying to accomplish.

 

When you use PROC FREQ to find the five most common values, you need to be a little careful. PROC FREQ is intended for categorical data analysis. The variables that you analyze should have discrete values, such as race, gender, and type of automobile.  For continuous variables, you can use PROC MEANS or PROC UNIVARIATE to reveal basic descriptive statistics. 

 

Uknown_user
Quartz | Level 8

I am trying to run it on consumer base table which consists of many countries. There are many different types of data ranging from telephone numbers to email addresses and gender or dates of birth etc. Each of these (except unique identifiers) is desired to be analyzed thought.

Reeza
Super User

@Uknown_user wrote:

I am trying to run it on consumer base table which consists of many countries. There are many different types of data ranging from telephone numbers to email addresses and gender or dates of birth etc. Each of these (except unique identifiers) is desired to be analyzed thought.


I can guarantee you that the 5 most frequent email addresses and date of birth are useless. 

Break up your variables by type, categorize them as categorical/numerical and use the appropriate summaries for each. 

You can still automate the analysis by the different types.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 35 replies
  • 2185 views
  • 10 likes
  • 7 in conversation