Re: Splitting dataset into two groups

sara_a · Posted 08-21-2015 03:50 PM

Hi,

I have two separate datasets that I would like to compare. I concatenated the datasets in order to be able to do t-tests and chi-square tests on but I'm not sure how to split the new dataset into two groups. There is no special features for either group only different ID numbers for each observation.

Reeza · Posted 08-21-2015 04:17 PM

So what differentiates the data? The source data sets? If so use INDSNAME to identify the source when appending.

data want;

set data1 data2 indsname=source;

indata=source;

run;

sara_a · Posted 08-21-2015 04:22 PM

Hi Reeza,

So, basically there was a larger dataset initially, random samples were taken from that larger datasets. This random sample has 70 people. I want to compare features from these 70 people with features from the observations that weren't randomly selected (n=472) to assess representativeness. Does that make more sense?

Thanks.

Reeza · Posted 08-22-2015 12:02 PM

That's a standard comparison - sample is similar to 'population'.

Using the method above will work to identify and then you can use class variable for comparison.

data want;

set pop sample indsname=source;

datain=source;

run;

proc freq data=want;

table datain*<variable of interest>/chisq;

run;

ballardw · Posted 08-21-2015 04:39 PM

One would hope that the original datasets, or source files to recreate the data sets, still exist. If the original data sets before concatenation no longer exist it may be that re-reading the source data files would be the best option.

Splitting dataset into two groups