Solved: Summarize a large table by calculating counts, means and standard devi...

St_Rokas · Posted 05-14-2023 12:18 AM

Hello,

I am trying to summarize and average data in a large table with up to 2.5 million rows and 120 columns. The structure is like this: a single CASE is divided into thousands of REGIONS, each of which can have a RANK (usually ranging from -30 to 30 including zero) and more than a hundred measured different FEATURES in that region:

I need to convert this data into two tables. One of them lists down unique CASES with all possible RANK values as colmns and has the count of REGIONS with such a rank in a cell, like this:

The second table should be a summary list: mean and standard deviation of each FEATURE per RANK present in a CASE, like this:

Of course the two tables can be combined by simply adding a column RANK_COUNT in the second table.

I would appreciate your help,

Thank you,

Rokas

yabwon · Posted 05-14-2023 10:41 AM

I think the most "standard way" would be to use the following approach:

data have;
  call streaminit(42);
  do Case="A","C","E","G","I","K","M","O","Q","S","U","W","Y";
    do Region = 1 to rand("integer",200, 300);
      Rank = rand("integer",-5, 5);
      array f feature_01-feature_20;
      do over f;
        f=rand("normal");
      end;
      output;
    end;
  end;
run;


proc sql;
  create table r as
    select C as Case, Rank, count(region) as rank_count
    from have(rename=(case=c))
    group by C, Rank
    order by C, Rank
  ;
quit;


proc summary data=have NWAY;
  class Case Rank;
  var feature_:;
  output  out=m_sd(drop=_:) MEAN(feature_:)= STD(feature_:)= / AUTONAME; 
run;

proc transpose data=m_sd(obs=0) out=rename(keep=_name_);
  var feature_:; 
run;

filename f TEMP;
data _null_;
file f;

put 'data Combined; merge r m_sd; by Case Rank; rename';

do until(EOF);
  set rename end=EOF;
  _name2_ = catx("_",scan(_NAME_,3,"_"),scan(_NAME_,1,"_"),scan(_NAME_,2,"_"));
  put _NAME_ "=" _NAME2_;
end;

put ';run;';
stop; 
run;
%include f /source2;

[EDIT:] ~~the only thing is that variables with mean and std will be called: feature_XX_mean and feature_XX_stdDev. If it is a big problem let me know.~~

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

View solution in original post

yabwon · Posted 05-14-2023 10:13 AM

Hi and welcome to SAS-communities!

If you want to get help faster provide some example data in usable form, not as pictures... the best as a data step with cards e.g.,

data have;
input x a b $ c d $;
cards;
1 11 A 100 ABC
2 22 B 200 DEF
3 33 C 300 GHI
4 44 D 400 JKL
;
run;

Read about it here: https://communities.sas.com/t5/help/faqpage/faq-category-id/posting/title/use_syntax_highlighter

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

St_Rokas · Posted 05-14-2023 12:53 PM

Thank you for the tip, Bart - next time I will most certainly do that.

yabwon · Posted 05-14-2023 10:41 AM

I think the most "standard way" would be to use the following approach:

data have;
  call streaminit(42);
  do Case="A","C","E","G","I","K","M","O","Q","S","U","W","Y";
    do Region = 1 to rand("integer",200, 300);
      Rank = rand("integer",-5, 5);
      array f feature_01-feature_20;
      do over f;
        f=rand("normal");
      end;
      output;
    end;
  end;
run;


proc sql;
  create table r as
    select C as Case, Rank, count(region) as rank_count
    from have(rename=(case=c))
    group by C, Rank
    order by C, Rank
  ;
quit;


proc summary data=have NWAY;
  class Case Rank;
  var feature_:;
  output  out=m_sd(drop=_:) MEAN(feature_:)= STD(feature_:)= / AUTONAME; 
run;

proc transpose data=m_sd(obs=0) out=rename(keep=_name_);
  var feature_:; 
run;

filename f TEMP;
data _null_;
file f;

put 'data Combined; merge r m_sd; by Case Rank; rename';

do until(EOF);
  set rename end=EOF;
  _name2_ = catx("_",scan(_NAME_,3,"_"),scan(_NAME_,1,"_"),scan(_NAME_,2,"_"));
  put _NAME_ "=" _NAME2_;
end;

put ';run;';
stop; 
run;
%include f /source2;

[EDIT:] ~~the only thing is that variables with mean and std will be called: feature_XX_mean and feature_XX_stdDev. If it is a big problem let me know.~~

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

St_Rokas · Posted 05-14-2023 12:54 PM

Thank you!

PaigeMiller · Posted 05-14-2023 12:34 PM

Hello @St_Rokas

If, in the last table, you want to "combine" the first and second table so that Rank_Count is vertical, don't create a table (as in your first table) such that the counts are horizontal. You are simply making your work more difficult. Create counts so that they are vertical.

PROC SUMMARY will create the final table, without first having to create the first two tables. As you have not provided data in a usable form (a "usable form" means working SAS data step code as shown by @yabwon, and not any other method) the following code is UNTESTED.

proc summary data=have nway;
    class case region rank;
    var feature_01 feature_02 feature_03;
    output out=want n(feature_01)=rank_count mean=f1_mean f2_mean f3_mean
        stddev=f1_std f2_std f3_std;
run;

--
Paige Miller

St_Rokas · Posted 05-14-2023 12:56 PM

Thank you for taking time to help!

Summarize a large table by calculating counts, means and standard deviations

Re: Summarize a large table by calculating counts, means and standard deviations

Re: Summarize a large table by calculating counts, means and standard deviations

Re: Summarize a large table by calculating counts, means and standard deviations

Re: Summarize a large table by calculating counts, means and standard deviations

Re: Summarize a large table by calculating counts, means and standard deviations

Re: Summarize a large table by calculating counts, means and standard deviations

Re: Summarize a large table by calculating counts, means and standard deviations

SAS Innovate 2025: Call for Content

Classroom Training Available!