BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
JosRois66
Calcite | Level 5

Hello,

Let's say that I have many individuals clustered (being the cluster a house, clinic, plot, etc). Each individual has an identification number, all individuals have characteristics such as weight, height, sex, income, etc, and they all have been measured multiple times. I want to calculate the sum of weights, sum of heights,  sum of income, etc, for ONLY all those individuals that are equal, greater, or smaller THAN an specific subject in the cluster or group; the subject individual is the one in the indicated row. How can I do that in the most accurate and efficient way?

The original table with data looks like this one:

clusteryearmeasurementindividual #heightweightincomestatus
12000111.6070L
21.7070L
31.8070L

The resulting table should look more or less like the one below.

clusteryearmeasurementindividual #heightweightincomestatus

heights of individuals

larger than the subject individual

weight of individuals heavier

than the subject individual

income of individuals

wealthier than the subject

12000111.6070L

I really appreciate any help that I can get; This might seem very simple for many, but I have not been able to solve this. Thanks for reading.

VR

1 ACCEPTED SOLUTION

Accepted Solutions
Fugue
Quartz | Level 8

How many observations in your dataset? One option is to use PROC SQL, but that will be inefficient for large datasets. You could use the following PROC SQL if you don't have tons of data. Note you will also have to decide how to deal with missing values. Also, the sums (when present) may be based on a very small number of observations and that may mess with your analysis.

/* create some fake data */

data have ;

      do cluster = 1 to 10 ;

            do year = 2000 to 2010 ;

                  do measurement = 1 to 3 ;

                        do individual = 1 to 30 ;

                              height = ranuni(1234)*10;

                              weight = ranuni(1234)*100;

                              income = ranuni(1234)*100000;

                              output;

                        end;

                  end;

            end;

      end;

run;

/* sum analysis variables for all other individuals in same cluster-year-measurement crossings where the variable of interest

      is greater than the individual's */

proc sql;

      create table want as

      select a.*, ( select sum ( b.weight )

                          from have b

                          where a.cluster=b.cluster

                              and a.year=b.year

                              and a.measurement=b.measurement

                              and a.weight < b.weight ) as weightabovesum

                        , ( select sum ( c.height )

                          from have c

                          where a.cluster=c.cluster

                              and a.year=c.year

                              and a.measurement=c.measurement

                              and a.height < c.height ) as heightabovesum

                        , ( select sum ( d.income )

                          from have d

                          where a.cluster=d.cluster

                              and a.year=d.year

                              and a.measurement=d.measurement

                              and a.income < d.income ) as incomeabovesum

      from have a

;

quit;

View solution in original post

13 REPLIES 13
Fugue
Quartz | Level 8

Question 1: Are the variables (columns) CLUSTER, YEAR and MEASUREMENT populated for every observation (row) in your data? The sample you have provided suggests not??

Question 2: Are you adding the relevant vars (HEIGHT, WEIGHT, INCOME) only for unique individuals at each crossing of CLUSTER, YEAR and MEASUREMENT? For example, in the sample output you've provided, do we only want individuals from cluster=1, year=2000 and measurement=1 that are greater than the specified individual?

Question 3: Are you wanting to do a many-to-many analysis (e.g. all unique crossings compared to all other matched crossings), or a many-to-one analysis (e.g. all matched crossings compared to one selected unique observation)?

JosRois66
Calcite | Level 5

Hi Fugue,

1. All variables (cluster, year, measurement) are populated for every individual (row) in the data set, I just did not do it in the example.

2. Yes, I only want individuals from cluster=1, year=2000, and measurement=1 that are greater than the specified individual.

3. I would like to do both, a many-to-many and a many-to-one; however, I want to use this info in an 'individual regression model' in which I will use the sum of heights/weights of individuals greater than the subject as an independent variable, to explain variation in other attributes.

Thanks for your help!

Fugue
Quartz | Level 8

I apologize . . . I haven't been able to look at this since Friday evening when I last responded. Any progress?

Fugue
Quartz | Level 8

How many observations in your dataset? One option is to use PROC SQL, but that will be inefficient for large datasets. You could use the following PROC SQL if you don't have tons of data. Note you will also have to decide how to deal with missing values. Also, the sums (when present) may be based on a very small number of observations and that may mess with your analysis.

/* create some fake data */

data have ;

      do cluster = 1 to 10 ;

            do year = 2000 to 2010 ;

                  do measurement = 1 to 3 ;

                        do individual = 1 to 30 ;

                              height = ranuni(1234)*10;

                              weight = ranuni(1234)*100;

                              income = ranuni(1234)*100000;

                              output;

                        end;

                  end;

            end;

      end;

run;

/* sum analysis variables for all other individuals in same cluster-year-measurement crossings where the variable of interest

      is greater than the individual's */

proc sql;

      create table want as

      select a.*, ( select sum ( b.weight )

                          from have b

                          where a.cluster=b.cluster

                              and a.year=b.year

                              and a.measurement=b.measurement

                              and a.weight < b.weight ) as weightabovesum

                        , ( select sum ( c.height )

                          from have c

                          where a.cluster=c.cluster

                              and a.year=c.year

                              and a.measurement=c.measurement

                              and a.height < c.height ) as heightabovesum

                        , ( select sum ( d.income )

                          from have d

                          where a.cluster=d.cluster

                              and a.year=d.year

                              and a.measurement=d.measurement

                              and a.income < d.income ) as incomeabovesum

      from have a

;

quit;

JosRois66
Calcite | Level 5

Hi Fugue!

Thanks for both replies! I was going to try with a 'retain' statement. I am going to try with the code you posted, hopefull it will work with my data set, which has about 25,000 individual observations. Thank you very much again!

JosRois66
Calcite | Level 5

Thank you very much, Fugue!! It worked well for my data set! Missing data shouldn't be a problem.

One more question. I have an additional classification variable (in the same cluster, year, and measurement), let's say ethnicity or race, and I want again to obtain the sum of weights, heights, etc for those individuals of ethnicity 'x' and individuals of ethnicity 'y' (separate) larger than my subject individual, is there any way to modify this code in order to do that?

I really appreciate your help!

Fugue
Quartz | Level 8

Did you get the correction to the code I posted earlier? The corrected code is:

proc sql;

      create table want as

      select a.*, ( select sum ( c.height )

                          from have c

                          where a.cluster=c.cluster

                              and a.year=c.year

                              and a.measurement=c.measurement

                              and a.height < c.height ) as heightabovesum

                        , ( select sum ( b.weight )

                          from have b

                          where a.cluster=b.cluster

                              and a.year=b.year

                              and a.measurement=b.measurement

                              and a.weight < b.weight ) as weightabovesum

                        , ( select sum ( d.income )

                          from have d

                          where a.cluster=d.cluster

                              and a.year=d.year

                              and a.measurement=d.measurement

                              and a.income < d.income ) as incomeabovesum

      from have a

;

quit;

Fugue
Quartz | Level 8

ok - needed to make sure because the first version had unnecessary individual < individual comparisons. Also, I had incorrectly referred to the height or weight (I can't remember which now) in the incomeabovesum block of code.

As for your question re: ethnicity, I'm not sure what you are asking. Can you elaborate? Perhaps provide a sample with how you expect the result to look . . .

JosRois66
Calcite | Level 5

Original data looks like this:

cluster

YearMeasurementEthnicityIndividualheightincome
120001A1
120001A2
120001A3
120001B1
120001B2
120001B3
120001B4
120001B5

Resulting table should look like this:

ClusterYearMeasurementEthnicityIndividualheightweightincome

Sum of weights of individuals

of the ethnicity A larger

than the subject individual

Sum of weights of individuals

of the ethnicity B larger

than the subject individual

120001A1
120001A2
120001A3
120001A4
120001A5
120001B1
120001B2
120001B3
120001B4

It would be the same data, all cells populated. In this case, it would be the sum of all heights, weights, etc, of all individuals of ethnicity A and ethnicity B (separate) larger than the individual 1, 2, 3, etc, for that specifc measurement, year and cluster; that is, comparison of individual 1 with ethnicity A, in the measurement 1, year 2000 and cluster 1, with heights/weights, etc of individuals of ethnicity A and with individuals of ethnicity B fo that specific cluster, year and measurement. Does that make sense?

Thank you!

Fugue
Quartz | Level 8

After adding an ethnicty variable with only two discrete values, my fake data is over 19,000 obs. With the added expressions, the query now takes more than 9 minutes on my Windows-based virtual machine. This is only partly due to the increase in the number of obs. The real issue in processing time is the number of passes through the data to handle each of the sub-queries for each individual . . .

proc sql;

      create table want as

      select a.*, ( select sum ( h1.height )

                          from have h1

                          where a.cluster=h1.cluster

                              and a.year=h1.year

                              and a.measurement=h1.measurement

                              and h1.ethnicity='A'

                              and a.height < h1.height ) as heightabovesum_A

            , ( select sum ( h2.height )

                          from have h2

                          where a.cluster=h2.cluster

                              and a.year=h2.year

                              and a.measurement=h2.measurement

                              and h2.ethnicity='B'

                              and a.height < h2.height ) as heightabovesum_B

          , ( select sum ( w1.weight )

                          from have w1

                          where a.cluster=w1.cluster

                              and a.year=w1.year

                              and a.measurement=w1.measurement

                              and w1.ethnicity='A'

                              and a.weight < w1.weight ) as weightabovesum_A

          , ( select sum ( w2.weight )

                          from have w2

                          where a.cluster=w2.cluster

                              and a.year=w2.year

                              and a.measurement=w2.measurement

                              and w2.ethnicity='B'

                              and a.weight < w2.weight ) as weightabovesum_B

         , ( select sum ( i1.income )

                          from have i1

                          where a.cluster=i1.cluster

                              and a.year=i1.year

                              and a.measurement=i1.measurement

                              and i1.ethnicity='A'

                              and a.income < i1.income ) as incomeabovesum_A

           , ( select sum ( i2.income )

                          from have i2

                          where a.cluster=i2.cluster

                              and a.year=i2.year

                              and a.measurement=i2.measurement

                              and i2.ethnicity='B'

                              and a.income < i2.income ) as incomeabovesum_B

      from have a

;

quit;

JosRois66
Calcite | Level 5

Many, many thanks, Fugue!! It worked very well with my data! I have over 25000 observations, and actually I just have (for now), 2 ethnicities as well! But in the future I will have at least 4 more and more observations. This query did not take too long to run in my machine (about 3 minutes), which is also windows-based but in mac (partition), so I assume that will take a bit longer with more ethnicity groups but it's fine!

Thank you again!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 1501 views
  • 2 likes
  • 2 in conversation