Hello,
Let's say that I have many individuals clustered (being the cluster a house, clinic, plot, etc). Each individual has an identification number, all individuals have characteristics such as weight, height, sex, income, etc, and they all have been measured multiple times. I want to calculate the sum of weights, sum of heights, sum of income, etc, for ONLY all those individuals that are equal, greater, or smaller THAN an specific subject in the cluster or group; the subject individual is the one in the indicated row. How can I do that in the most accurate and efficient way?
The original table with data looks like this one:
cluster | year | measurement | individual # | height | weight | income | status |
---|---|---|---|---|---|---|---|
1 | 2000 | 1 | 1 | 1.60 | 70 | L | |
2 | 1.70 | 70 | L | ||||
3 | 1.80 | 70 | L |
The resulting table should look more or less like the one below.
cluster | year | measurement | individual # | height | weight | income | status | heights of individuals larger than the subject individual | weight of individuals heavier than the subject individual | income of individuals wealthier than the subject |
---|---|---|---|---|---|---|---|---|---|---|
1 | 2000 | 1 | 1 | 1.60 | 70 | L |
I really appreciate any help that I can get; This might seem very simple for many, but I have not been able to solve this. Thanks for reading.
VR
How many observations in your dataset? One option is to use PROC SQL, but that will be inefficient for large datasets. You could use the following PROC SQL if you don't have tons of data. Note you will also have to decide how to deal with missing values. Also, the sums (when present) may be based on a very small number of observations and that may mess with your analysis.
/* create some fake data */
data have ;
do cluster = 1 to 10 ;
do year = 2000 to 2010 ;
do measurement = 1 to 3 ;
do individual = 1 to 30 ;
height = ranuni(1234)*10;
weight = ranuni(1234)*100;
income = ranuni(1234)*100000;
output;
end;
end;
end;
end;
run;
/* sum analysis variables for all other individuals in same cluster-year-measurement crossings where the variable of interest
is greater than the individual's */
proc sql;
create table want as
select a.*, ( select sum ( b.weight )
from have b
where a.cluster=b.cluster
and a.year=b.year
and a.measurement=b.measurement
and a.weight < b.weight ) as weightabovesum
, ( select sum ( c.height )
from have c
where a.cluster=c.cluster
and a.year=c.year
and a.measurement=c.measurement
and a.height < c.height ) as heightabovesum
, ( select sum ( d.income )
from have d
where a.cluster=d.cluster
and a.year=d.year
and a.measurement=d.measurement
and a.income < d.income ) as incomeabovesum
from have a
;
quit;
Question 1: Are the variables (columns) CLUSTER, YEAR and MEASUREMENT populated for every observation (row) in your data? The sample you have provided suggests not??
Question 2: Are you adding the relevant vars (HEIGHT, WEIGHT, INCOME) only for unique individuals at each crossing of CLUSTER, YEAR and MEASUREMENT? For example, in the sample output you've provided, do we only want individuals from cluster=1, year=2000 and measurement=1 that are greater than the specified individual?
Question 3: Are you wanting to do a many-to-many analysis (e.g. all unique crossings compared to all other matched crossings), or a many-to-one analysis (e.g. all matched crossings compared to one selected unique observation)?
Hi Fugue,
1. All variables (cluster, year, measurement) are populated for every individual (row) in the data set, I just did not do it in the example.
2. Yes, I only want individuals from cluster=1, year=2000, and measurement=1 that are greater than the specified individual.
3. I would like to do both, a many-to-many and a many-to-one; however, I want to use this info in an 'individual regression model' in which I will use the sum of heights/weights of individuals greater than the subject as an independent variable, to explain variation in other attributes.
Thanks for your help!
I apologize . . . I haven't been able to look at this since Friday evening when I last responded. Any progress?
How many observations in your dataset? One option is to use PROC SQL, but that will be inefficient for large datasets. You could use the following PROC SQL if you don't have tons of data. Note you will also have to decide how to deal with missing values. Also, the sums (when present) may be based on a very small number of observations and that may mess with your analysis.
/* create some fake data */
data have ;
do cluster = 1 to 10 ;
do year = 2000 to 2010 ;
do measurement = 1 to 3 ;
do individual = 1 to 30 ;
height = ranuni(1234)*10;
weight = ranuni(1234)*100;
income = ranuni(1234)*100000;
output;
end;
end;
end;
end;
run;
/* sum analysis variables for all other individuals in same cluster-year-measurement crossings where the variable of interest
is greater than the individual's */
proc sql;
create table want as
select a.*, ( select sum ( b.weight )
from have b
where a.cluster=b.cluster
and a.year=b.year
and a.measurement=b.measurement
and a.weight < b.weight ) as weightabovesum
, ( select sum ( c.height )
from have c
where a.cluster=c.cluster
and a.year=c.year
and a.measurement=c.measurement
and a.height < c.height ) as heightabovesum
, ( select sum ( d.income )
from have d
where a.cluster=d.cluster
and a.year=d.year
and a.measurement=d.measurement
and a.income < d.income ) as incomeabovesum
from have a
;
quit;
Hi Fugue!
Thanks for both replies! I was going to try with a 'retain' statement. I am going to try with the code you posted, hopefull it will work with my data set, which has about 25,000 individual observations. Thank you very much again!
Note the correction I made to the code I posted.
Thank you very much, Fugue!! It worked well for my data set! Missing data shouldn't be a problem.
One more question. I have an additional classification variable (in the same cluster, year, and measurement), let's say ethnicity or race, and I want again to obtain the sum of weights, heights, etc for those individuals of ethnicity 'x' and individuals of ethnicity 'y' (separate) larger than my subject individual, is there any way to modify this code in order to do that?
I really appreciate your help!
Did you get the correction to the code I posted earlier? The corrected code is:
proc sql;
create table want as
select a.*, ( select sum ( c.height )
from have c
where a.cluster=c.cluster
and a.year=c.year
and a.measurement=c.measurement
and a.height < c.height ) as heightabovesum
, ( select sum ( b.weight )
from have b
where a.cluster=b.cluster
and a.year=b.year
and a.measurement=b.measurement
and a.weight < b.weight ) as weightabovesum
, ( select sum ( d.income )
from have d
where a.cluster=d.cluster
and a.year=d.year
and a.measurement=d.measurement
and a.income < d.income ) as incomeabovesum
from have a
;
quit;
Yes, I did get it. Thank you!
ok - needed to make sure because the first version had unnecessary individual < individual comparisons. Also, I had incorrectly referred to the height or weight (I can't remember which now) in the incomeabovesum block of code.
As for your question re: ethnicity, I'm not sure what you are asking. Can you elaborate? Perhaps provide a sample with how you expect the result to look . . .
Original data looks like this:
cluster | Year | Measurement | Ethnicity | Individual | height | income |
---|---|---|---|---|---|---|
1 | 2000 | 1 | A | 1 | ||
1 | 2000 | 1 | A | 2 | ||
1 | 2000 | 1 | A | 3 | ||
1 | 2000 | 1 | B | 1 | ||
1 | 2000 | 1 | B | 2 | ||
1 | 2000 | 1 | B | 3 | ||
1 | 2000 | 1 | B | 4 | ||
1 | 2000 | 1 | B | 5 |
Resulting table should look like this:
Cluster | Year | Measurement | Ethnicity | Individual | height | weight | income | Sum of weights of individuals of the ethnicity A larger than the subject individual | Sum of weights of individuals of the ethnicity B larger than the subject individual |
---|---|---|---|---|---|---|---|---|---|
1 | 2000 | 1 | A | 1 | |||||
1 | 2000 | 1 | A | 2 | |||||
1 | 2000 | 1 | A | 3 | |||||
1 | 2000 | 1 | A | 4 | |||||
1 | 2000 | 1 | A | 5 | |||||
1 | 2000 | 1 | B | 1 | |||||
1 | 2000 | 1 | B | 2 | |||||
1 | 2000 | 1 | B | 3 | |||||
1 | 2000 | 1 | B | 4 | |||||
It would be the same data, all cells populated. In this case, it would be the sum of all heights, weights, etc, of all individuals of ethnicity A and ethnicity B (separate) larger than the individual 1, 2, 3, etc, for that specifc measurement, year and cluster; that is, comparison of individual 1 with ethnicity A, in the measurement 1, year 2000 and cluster 1, with heights/weights, etc of individuals of ethnicity A and with individuals of ethnicity B fo that specific cluster, year and measurement. Does that make sense?
Thank you!
After adding an ethnicty variable with only two discrete values, my fake data is over 19,000 obs. With the added expressions, the query now takes more than 9 minutes on my Windows-based virtual machine. This is only partly due to the increase in the number of obs. The real issue in processing time is the number of passes through the data to handle each of the sub-queries for each individual . . .
proc sql;
create table want as
select a.*, ( select sum ( h1.height )
from have h1
where a.cluster=h1.cluster
and a.year=h1.year
and a.measurement=h1.measurement
and h1.ethnicity='A'
and a.height < h1.height ) as heightabovesum_A
, ( select sum ( h2.height )
from have h2
where a.cluster=h2.cluster
and a.year=h2.year
and a.measurement=h2.measurement
and h2.ethnicity='B'
and a.height < h2.height ) as heightabovesum_B
, ( select sum ( w1.weight )
from have w1
where a.cluster=w1.cluster
and a.year=w1.year
and a.measurement=w1.measurement
and w1.ethnicity='A'
and a.weight < w1.weight ) as weightabovesum_A
, ( select sum ( w2.weight )
from have w2
where a.cluster=w2.cluster
and a.year=w2.year
and a.measurement=w2.measurement
and w2.ethnicity='B'
and a.weight < w2.weight ) as weightabovesum_B
, ( select sum ( i1.income )
from have i1
where a.cluster=i1.cluster
and a.year=i1.year
and a.measurement=i1.measurement
and i1.ethnicity='A'
and a.income < i1.income ) as incomeabovesum_A
, ( select sum ( i2.income )
from have i2
where a.cluster=i2.cluster
and a.year=i2.year
and a.measurement=i2.measurement
and i2.ethnicity='B'
and a.income < i2.income ) as incomeabovesum_B
from have a
;
quit;
Many, many thanks, Fugue!! It worked very well with my data! I have over 25000 observations, and actually I just have (for now), 2 ethnicities as well! But in the future I will have at least 4 more and more observations. This query did not take too long to run in my machine (about 3 minutes), which is also windows-based but in mac (partition), so I assume that will take a bit longer with more ethnicity groups but it's fine!
Thank you again!
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.