SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Hello,

Let's say that I have many individuals clustered (being the cluster a house, clinic, plot, etc). Each individual has an identification number, all individuals have characteristics such as weight, height, sex, income, etc, and they all have been measured multiple times. I want to calculate the sum of weights, sum of heights,  sum of income, etc, for ONLY all those individuals that are equal, greater, or smaller THAN an specific subject in the cluster or group; the subject individual is the one in the indicated row. How can I do that in the most accurate and efficient way?

The original table with data looks like this one:

clusteryearmeasurementindividual #heightweightincomestatus
12000111.6070L
21.7070L
31.8070L

The resulting table should look more or less like the one below.

clusteryearmeasurementindividual #heightweightincomestatus

heights of individuals

larger than the subject individual

weight of individuals heavier

than the subject individual

income of individuals

wealthier than the subject

12000111.6070L

I really appreciate any help that I can get; This might seem very simple for many, but I have not been able to solve this. Thanks for reading.

VR


Accepted Solutions
Solution
‎01-27-2014 02:37 PM
Super Contributor
Posts: 307

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Posted in reply to JosRois66

How many observations in your dataset? One option is to use PROC SQL, but that will be inefficient for large datasets. You could use the following PROC SQL if you don't have tons of data. Note you will also have to decide how to deal with missing values. Also, the sums (when present) may be based on a very small number of observations and that may mess with your analysis.

/* create some fake data */

data have ;

      do cluster = 1 to 10 ;

            do year = 2000 to 2010 ;

                  do measurement = 1 to 3 ;

                        do individual = 1 to 30 ;

                              height = ranuni(1234)*10;

                              weight = ranuni(1234)*100;

                              income = ranuni(1234)*100000;

                              output;

                        end;

                  end;

            end;

      end;

run;

/* sum analysis variables for all other individuals in same cluster-year-measurement crossings where the variable of interest

      is greater than the individual's */

proc sql;

      create table want as

      select a.*, ( select sum ( b.weight )

                          from have b

                          where a.cluster=b.cluster

                              and a.year=b.year

                              and a.measurement=b.measurement

                              and a.weight < b.weight ) as weightabovesum

                        , ( select sum ( c.height )

                          from have c

                          where a.cluster=c.cluster

                              and a.year=c.year

                              and a.measurement=c.measurement

                              and a.height < c.height ) as heightabovesum

                        , ( select sum ( d.income )

                          from have d

                          where a.cluster=d.cluster

                              and a.year=d.year

                              and a.measurement=d.measurement

                              and a.income < d.income ) as incomeabovesum

      from have a

;

quit;

View solution in original post


All Replies
Super Contributor
Posts: 307

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Posted in reply to JosRois66

Question 1: Are the variables (columns) CLUSTER, YEAR and MEASUREMENT populated for every observation (row) in your data? The sample you have provided suggests not??

Question 2: Are you adding the relevant vars (HEIGHT, WEIGHT, INCOME) only for unique individuals at each crossing of CLUSTER, YEAR and MEASUREMENT? For example, in the sample output you've provided, do we only want individuals from cluster=1, year=2000 and measurement=1 that are greater than the specified individual?

Question 3: Are you wanting to do a many-to-many analysis (e.g. all unique crossings compared to all other matched crossings), or a many-to-one analysis (e.g. all matched crossings compared to one selected unique observation)?

Occasional Contributor
Posts: 7

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Hi Fugue,

1. All variables (cluster, year, measurement) are populated for every individual (row) in the data set, I just did not do it in the example.

2. Yes, I only want individuals from cluster=1, year=2000, and measurement=1 that are greater than the specified individual.

3. I would like to do both, a many-to-many and a many-to-one; however, I want to use this info in an 'individual regression model' in which I will use the sum of heights/weights of individuals greater than the subject as an independent variable, to explain variation in other attributes.

Thanks for your help!

Super Contributor
Posts: 307

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Posted in reply to JosRois66

I apologize . . . I haven't been able to look at this since Friday evening when I last responded. Any progress?

Solution
‎01-27-2014 02:37 PM
Super Contributor
Posts: 307

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Posted in reply to JosRois66

How many observations in your dataset? One option is to use PROC SQL, but that will be inefficient for large datasets. You could use the following PROC SQL if you don't have tons of data. Note you will also have to decide how to deal with missing values. Also, the sums (when present) may be based on a very small number of observations and that may mess with your analysis.

/* create some fake data */

data have ;

      do cluster = 1 to 10 ;

            do year = 2000 to 2010 ;

                  do measurement = 1 to 3 ;

                        do individual = 1 to 30 ;

                              height = ranuni(1234)*10;

                              weight = ranuni(1234)*100;

                              income = ranuni(1234)*100000;

                              output;

                        end;

                  end;

            end;

      end;

run;

/* sum analysis variables for all other individuals in same cluster-year-measurement crossings where the variable of interest

      is greater than the individual's */

proc sql;

      create table want as

      select a.*, ( select sum ( b.weight )

                          from have b

                          where a.cluster=b.cluster

                              and a.year=b.year

                              and a.measurement=b.measurement

                              and a.weight < b.weight ) as weightabovesum

                        , ( select sum ( c.height )

                          from have c

                          where a.cluster=c.cluster

                              and a.year=c.year

                              and a.measurement=c.measurement

                              and a.height < c.height ) as heightabovesum

                        , ( select sum ( d.income )

                          from have d

                          where a.cluster=d.cluster

                              and a.year=d.year

                              and a.measurement=d.measurement

                              and a.income < d.income ) as incomeabovesum

      from have a

;

quit;

Occasional Contributor
Posts: 7

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Hi Fugue!

Thanks for both replies! I was going to try with a 'retain' statement. I am going to try with the code you posted, hopefull it will work with my data set, which has about 25,000 individual observations. Thank you very much again!

Super Contributor
Posts: 307

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Posted in reply to JosRois66

Note the correction I made to the code I posted.

Occasional Contributor
Posts: 7

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Thank you very much, Fugue!! It worked well for my data set! Missing data shouldn't be a problem.

One more question. I have an additional classification variable (in the same cluster, year, and measurement), let's say ethnicity or race, and I want again to obtain the sum of weights, heights, etc for those individuals of ethnicity 'x' and individuals of ethnicity 'y' (separate) larger than my subject individual, is there any way to modify this code in order to do that?

I really appreciate your help!

Super Contributor
Posts: 307

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Posted in reply to JosRois66

Did you get the correction to the code I posted earlier? The corrected code is:

proc sql;

      create table want as

      select a.*, ( select sum ( c.height )

                          from have c

                          where a.cluster=c.cluster

                              and a.year=c.year

                              and a.measurement=c.measurement

                              and a.height < c.height ) as heightabovesum

                        , ( select sum ( b.weight )

                          from have b

                          where a.cluster=b.cluster

                              and a.year=b.year

                              and a.measurement=b.measurement

                              and a.weight < b.weight ) as weightabovesum

                        , ( select sum ( d.income )

                          from have d

                          where a.cluster=d.cluster

                              and a.year=d.year

                              and a.measurement=d.measurement

                              and a.income < d.income ) as incomeabovesum

      from have a

;

quit;

Occasional Contributor
Posts: 7

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Yes, I did get it. Thank you!

Super Contributor
Posts: 307

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Posted in reply to JosRois66

ok - needed to make sure because the first version had unnecessary individual < individual comparisons. Also, I had incorrectly referred to the height or weight (I can't remember which now) in the incomeabovesum block of code.

As for your question re: ethnicity, I'm not sure what you are asking. Can you elaborate? Perhaps provide a sample with how you expect the result to look . . .

Occasional Contributor
Posts: 7

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Original data looks like this:

cluster

YearMeasurementEthnicityIndividualheightincome
120001A1
120001A2
120001A3
120001B1
120001B2
120001B3
120001B4
120001B5

Resulting table should look like this:

ClusterYearMeasurementEthnicityIndividualheightweightincome

Sum of weights of individuals

of the ethnicity A larger

than the subject individual

Sum of weights of individuals

of the ethnicity B larger

than the subject individual

120001A1
120001A2
120001A3
120001A4
120001A5
120001B1
120001B2
120001B3
120001B4

It would be the same data, all cells populated. In this case, it would be the sum of all heights, weights, etc, of all individuals of ethnicity A and ethnicity B (separate) larger than the individual 1, 2, 3, etc, for that specifc measurement, year and cluster; that is, comparison of individual 1 with ethnicity A, in the measurement 1, year 2000 and cluster 1, with heights/weights, etc of individuals of ethnicity A and with individuals of ethnicity B fo that specific cluster, year and measurement. Does that make sense?

Thank you!

Super Contributor
Posts: 307

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Posted in reply to JosRois66

After adding an ethnicty variable with only two discrete values, my fake data is over 19,000 obs. With the added expressions, the query now takes more than 9 minutes on my Windows-based virtual machine. This is only partly due to the increase in the number of obs. The real issue in processing time is the number of passes through the data to handle each of the sub-queries for each individual . . .

proc sql;

      create table want as

      select a.*, ( select sum ( h1.height )

                          from have h1

                          where a.cluster=h1.cluster

                              and a.year=h1.year

                              and a.measurement=h1.measurement

                              and h1.ethnicity='A'

                              and a.height < h1.height ) as heightabovesum_A

            , ( select sum ( h2.height )

                          from have h2

                          where a.cluster=h2.cluster

                              and a.year=h2.year

                              and a.measurement=h2.measurement

                              and h2.ethnicity='B'

                              and a.height < h2.height ) as heightabovesum_B

          , ( select sum ( w1.weight )

                          from have w1

                          where a.cluster=w1.cluster

                              and a.year=w1.year

                              and a.measurement=w1.measurement

                              and w1.ethnicity='A'

                              and a.weight < w1.weight ) as weightabovesum_A

          , ( select sum ( w2.weight )

                          from have w2

                          where a.cluster=w2.cluster

                              and a.year=w2.year

                              and a.measurement=w2.measurement

                              and w2.ethnicity='B'

                              and a.weight < w2.weight ) as weightabovesum_B

         , ( select sum ( i1.income )

                          from have i1

                          where a.cluster=i1.cluster

                              and a.year=i1.year

                              and a.measurement=i1.measurement

                              and i1.ethnicity='A'

                              and a.income < i1.income ) as incomeabovesum_A

           , ( select sum ( i2.income )

                          from have i2

                          where a.cluster=i2.cluster

                              and a.year=i2.year

                              and a.measurement=i2.measurement

                              and i2.ethnicity='B'

                              and a.income < i2.income ) as incomeabovesum_B

      from have a

;

quit;

Occasional Contributor
Posts: 7

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Many, many thanks, Fugue!! It worked very well with my data! I have over 25000 observations, and actually I just have (for now), 2 ethnicities as well! But in the future I will have at least 4 more and more observations. This query did not take too long to run in my machine (about 3 minutes), which is also windows-based but in mac (partition), so I assume that will take a bit longer with more ethnicity groups but it's fine!

Thank you again!

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 13 replies
  • 565 views
  • 2 likes
  • 2 in conversation