Solved: How to do calculations for a group of individuals smaller/larger than ...

JosRois66 · Posted 01-24-2014 08:06 PM

Hello,

Let's say that I have many individuals clustered (being the cluster a house, clinic, plot, etc). Each individual has an identification number, all individuals have characteristics such as weight, height, sex, income, etc, and they all have been measured multiple times. I want to calculate the sum of weights, sum of heights, sum of income, etc, for ONLY all those individuals that are equal, greater, or smaller THAN an specific subject in the cluster or group; the subject individual is the one in the indicated row. How can I do that in the most accurate and efficient way?

The original table with data looks like this one:

cluster	year	measurement	individual #	height	weight	status
1	2000	1	1	1.60	70	L
			2	1.70	70	L
			3	1.80	70	L

The resulting table should look more or less like the one below.

cluster

year

measurement

individual #

height

weight

income

status

heights of individuals

larger than the subject individual

weight of individuals heavier

than the subject individual

income of individuals

wealthier than the subject

1

2000

1

1.60

70

L

I really appreciate any help that I can get; This might seem very simple for many, but I have not been able to solve this. Thanks for reading.

VR

Fugue · Posted 01-27-2014 02:37 PM

How many observations in your dataset? One option is to use PROC SQL, but that will be inefficient for large datasets. You could use the following PROC SQL if you don't have tons of data. Note you will also have to decide how to deal with missing values. Also, the sums (when present) may be based on a very small number of observations and that may mess with your analysis.

/* create some fake data */

data have ;

do cluster = 1 to 10 ;

do year = 2000 to 2010 ;

do measurement = 1 to 3 ;

do individual = 1 to 30 ;

height = ranuni(1234)*10;

weight = ranuni(1234)*100;

income = ranuni(1234)*100000;

output;

end;

run;

/* sum analysis variables for all other individuals in same cluster-year-measurement crossings where the variable of interest

is greater than the individual's */

proc sql;

create table want as

select a.*, ( select sum ( b.weight )

from have b

where a.cluster=b.cluster

and a.year=b.year

and a.measurement=b.measurement

and a.weight < b.weight ) as weightabovesum

, ( select sum ( c.height )

from have c

where a.cluster=c.cluster

and a.year=c.year

and a.measurement=c.measurement

and a.height < c.height ) as heightabovesum

, ( select sum ( d.income )

from have d

where a.cluster=d.cluster

and a.year=d.year

and a.measurement=d.measurement

and a.income < d.income ) as incomeabovesum

from have a

;

quit;

View solution in original post

Fugue · Posted 01-24-2014 08:19 PM

Question 1: Are the variables (columns) CLUSTER, YEAR and MEASUREMENT populated for every observation (row) in your data? The sample you have provided suggests not??

Question 2: Are you adding the relevant vars (HEIGHT, WEIGHT, INCOME) only for unique individuals at each crossing of CLUSTER, YEAR and MEASUREMENT? For example, in the sample output you've provided, do we only want individuals from cluster=1, year=2000 and measurement=1 that are greater than the specified individual?

Question 3: Are you wanting to do a many-to-many analysis (e.g. all unique crossings compared to all other matched crossings), or a many-to-one analysis (e.g. all matched crossings compared to one selected unique observation)?

JosRois66 · Posted 01-24-2014 08:33 PM

Hi Fugue,

1. All variables (cluster, year, measurement) are populated for every individual (row) in the data set, I just did not do it in the example.

2. Yes, I only want individuals from cluster=1, year=2000, and measurement=1 that are greater than the specified individual.

3. I would like to do both, a many-to-many and a many-to-one; however, I want to use this info in an 'individual regression model' in which I will use the sum of heights/weights of individuals greater than the subject as an independent variable, to explain variation in other attributes.

Thanks for your help!

Fugue · Posted 01-27-2014 01:47 PM

I apologize . . . I haven't been able to look at this since Friday evening when I last responded. Any progress?

Fugue · Posted 01-27-2014 02:37 PM

How many observations in your dataset? One option is to use PROC SQL, but that will be inefficient for large datasets. You could use the following PROC SQL if you don't have tons of data. Note you will also have to decide how to deal with missing values. Also, the sums (when present) may be based on a very small number of observations and that may mess with your analysis.

/* create some fake data */

data have ;

do cluster = 1 to 10 ;

do year = 2000 to 2010 ;

do measurement = 1 to 3 ;

do individual = 1 to 30 ;

height = ranuni(1234)*10;

weight = ranuni(1234)*100;

income = ranuni(1234)*100000;

output;

end;

run;

/* sum analysis variables for all other individuals in same cluster-year-measurement crossings where the variable of interest

is greater than the individual's */

proc sql;

create table want as

select a.*, ( select sum ( b.weight )

from have b

where a.cluster=b.cluster

and a.year=b.year

and a.measurement=b.measurement

and a.weight < b.weight ) as weightabovesum

, ( select sum ( c.height )

from have c

where a.cluster=c.cluster

and a.year=c.year

and a.measurement=c.measurement

and a.height < c.height ) as heightabovesum

, ( select sum ( d.income )

from have d

where a.cluster=d.cluster

and a.year=d.year

and a.measurement=d.measurement

and a.income < d.income ) as incomeabovesum

from have a

;

quit;

JosRois66 · Posted 01-27-2014 03:45 PM

Hi Fugue!

Thanks for both replies! I was going to try with a 'retain' statement. I am going to try with the code you posted, hopefull it will work with my data set, which has about 25,000 individual observations. Thank you very much again!

Fugue · Posted 01-27-2014 04:24 PM

Note the correction I made to the code I posted.

JosRois66 · Posted 01-27-2014 06:39 PM

Thank you very much, Fugue!! It worked well for my data set! Missing data shouldn't be a problem.

One more question. I have an additional classification variable (in the same cluster, year, and measurement), let's say ethnicity or race, and I want again to obtain the sum of weights, heights, etc for those individuals of ethnicity 'x' and individuals of ethnicity 'y' (separate) larger than my subject individual, is there any way to modify this code in order to do that?

I really appreciate your help!

Fugue · Posted 01-27-2014 06:54 PM

Did you get the correction to the code I posted earlier? The corrected code is:

proc sql;

create table want as

select a.*, ( select sum ( c.height )

from have c

where a.cluster=c.cluster

and a.year=c.year

and a.measurement=c.measurement

and a.height < c.height ) as heightabovesum

, ( select sum ( b.weight )

from have b

where a.cluster=b.cluster

and a.year=b.year

and a.measurement=b.measurement

and a.weight < b.weight ) as weightabovesum

, ( select sum ( d.income )

from have d

where a.cluster=d.cluster

and a.year=d.year

and a.measurement=d.measurement

and a.income < d.income ) as incomeabovesum

from have a

;

quit;

JosRois66 · Posted 01-27-2014 07:04 PM

Yes, I did get it. Thank you!

Fugue · Posted 01-27-2014 07:10 PM

ok - needed to make sure because the first version had unnecessary individual < individual comparisons. Also, I had incorrectly referred to the height or weight (I can't remember which now) in the incomeabovesum block of code.

As for your question re: ethnicity, I'm not sure what you are asking. Can you elaborate? Perhaps provide a sample with how you expect the result to look . . .

JosRois66 · Posted 01-27-2014 07:39 PM

Original data looks like this:

cluster	Year	Measurement	Ethnicity	Individual
1	2000	1	A	1
1	2000	1	A	2
1	2000	1	A	3
1	2000	1	B	1
1	2000	1	B	2
1	2000	1	B	3
1	2000	1	B	4
1	2000	1	B	5

Resulting table should look like this:

Cluster	Year	Measurement	Ethnicity	Individual
1	2000	1	A	1
1	2000	1	A	2
1	2000	1	A	3
1	2000	1	A	4
1	2000	1	A	5
1	2000	1	B	1
1	2000	1	B	2
1	2000	1	B	3
1	2000	1	B	4

It would be the same data, all cells populated. In this case, it would be the sum of all heights, weights, etc, of all individuals of ethnicity A and ethnicity B (separate) larger than the individual 1, 2, 3, etc, for that specifc measurement, year and cluster; that is, comparison of individual 1 with ethnicity A, in the measurement 1, year 2000 and cluster 1, with heights/weights, etc of individuals of ethnicity A and with individuals of ethnicity B fo that specific cluster, year and measurement. Does that make sense?

Thank you!

Fugue · Posted 01-27-2014 10:03 PM

After adding an ethnicty variable with only two discrete values, my fake data is over 19,000 obs. With the added expressions, the query now takes more than 9 minutes on my Windows-based virtual machine. This is only partly due to the increase in the number of obs. The real issue in processing time is the number of passes through the data to handle each of the sub-queries for each individual . . .

proc sql;

create table want as

select a.*, ( select sum ( h1.height )

from have h1

where a.cluster=h1.cluster

and a.year=h1.year

and a.measurement=h1.measurement

and h1.ethnicity='A'

and a.height < h1.height ) as heightabovesum_A

, ( select sum ( h2.height )

from have h2

where a.cluster=h2.cluster

and a.year=h2.year

and a.measurement=h2.measurement

and h2.ethnicity='B'

and a.height < h2.height ) as heightabovesum_B

, ( select sum ( w1.weight )

from have w1

where a.cluster=w1.cluster

and a.year=w1.year

and a.measurement=w1.measurement

and w1.ethnicity='A'

and a.weight < w1.weight ) as weightabovesum_A

, ( select sum ( w2.weight )

from have w2

where a.cluster=w2.cluster

and a.year=w2.year

and a.measurement=w2.measurement

and w2.ethnicity='B'

and a.weight < w2.weight ) as weightabovesum_B

, ( select sum ( i1.income )

from have i1

where a.cluster=i1.cluster

and a.year=i1.year

and a.measurement=i1.measurement

and i1.ethnicity='A'

and a.income < i1.income ) as incomeabovesum_A

, ( select sum ( i2.income )

from have i2

where a.cluster=i2.cluster

and a.year=i2.year

and a.measurement=i2.measurement

and i2.ethnicity='B'

and a.income < i2.income ) as incomeabovesum_B

from have a

;

quit;

JosRois66 · Posted 01-27-2014 10:29 PM

Many, many thanks, Fugue!! It worked very well with my data! I have over 25000 observations, and actually I just have (for now), 2 ethnicities as well! But in the future I will have at least 4 more and more observations. This query did not take too long to run in my machine (about 3 minutes), which is also windows-based but in mac (partition), so I assume that will take a bit longer with more ethnicity groups but it's fine!

Thank you again!

How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Re: How to do calculations for a group of individuals smaller/larger than an specific subject in the cluster?

Registration is open