Solved: grouping records

michtka · Posted 09-04-2012 11:30 AM

Hi everyone, I need to classify every record of the activity of test2,
in groups...1, 2 , 3. Which group? The group number is given in test1.

I give you some a couple of examles to explain the problem:

       ...the first record in test2 of subjid=1 (activity=15)
       has to be assigned to the group=1, because 15 for subjid=1 is
       between 10 and 20 beloging to group 1 and so on.

       the first record in test2 of subjid=2 (activity=03)
       has to be missing (group=.) because there is not group below 5 assigned in test1
       to subjid=2

In my problem I have 40 subjid, I only show 2 to simplify the problem,
andI need you to help me with a general code (using matrix I suppose). Thanks a lot. V

       data test1;
       input subjid awake sleep group;
       datalines;
       1 10 20 1
       1 30 40 2
       1 50 . 1
       2 05 10 1
       2 15 25 2
       2 28 42 3
       2 53 60 2
       ;
       run;

       data test2;
       input subjid activity;
       datalines;
       1 15
       1 25
       1 35
       1 39
       1 51
       2 03
       2 07
       2 16
       2 43
       2 55
       2 63
       ;
       run;

Haikuo · Posted 09-04-2012 11:55 AM

Hi,

You didn't say what to do with the missing value, so this approach is not factoring in missing values in variable 'awake' or 'sleep':

	proc sql;
	select distinct a.*, group from test2 a
	left join test1 b
	on ( awake<= activity<= sleep) and a.subjid=b.subjid order by subjid, activity;
	quit;

Haikuo

View solution in original post

Haikuo · Posted 09-04-2012 11:55 AM

Hi,

You didn't say what to do with the missing value, so this approach is not factoring in missing values in variable 'awake' or 'sleep':

	proc sql;
	select distinct a.*, group from test2 a
	left join test1 b
	on ( awake<= activity<= sleep) and a.subjid=b.subjid order by subjid, activity;
	quit;

Haikuo

michtka · Posted 09-04-2012 12:16 PM

Thnaks Hai. Kuo, it works perfectly, I thought i will need arrays, but with sql you did it right. Thnaks a lot.

DanielSantos · Posted 09-04-2012 01:01 PM

Array's aren't really needed, and merge probably won't work.

Here's another solution using datastep, conditionally iterating through test1 (test2 would be the driver table).

data want (drop=sleep awake _:); * drop limits and temp vars;

set test2 ;

if subjid ne _subjid or activity > sleep then

set test1 (rename=(subjid=_subjid group=_group)); * verify match;

if activity < awake then group=.; else group=_group; * set desired group;

run;

This assumes that tables were previously ordered by subjid,activity and subjid,awake,sleep has shown upside.

If you're dealing with a large volumes, I believe this would be a good performance wise code.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Haikuo · Posted 09-04-2012 01:45 PM

Hi,

I don't think your code works the way that OP wanted. Remember that the implied data step loop only goes one way, unless you loop through using point=. While if you do so, then the efficiency will get down to the SQL level since they are doing the identical thing: making Cartesian products. So efficiency wise, hash() is an option.

/*point=*/

data want (keep=subjid activity group);

set test2 (rename=subjid=_subjid);

do n=1 to nobs ;

set test1 point=n nobs=nobs;

if subjid=_subjid and awake<=activity<=sleep then do; output; leave;end;

else if n=nobs then do; call missing(group); output;end;

end;

run;

/* better to hash() it*/

data want (keep=subjid activity group);

if _n_=1 then do;

if 0 then set test1;

dcl hash h(dataset:'test1', multidata:'y');

h.definekey('subjid');

h.definedata(all:'y');

h.definedone();

end;

set test2;

rc=h.find();

do rc=0 by 0 while (rc=0);

if awake<=activity<=sleep then do; output; leave;end;

rc=h.find_next();

if rc ne 0 then do; call missing(group); output;end;

end;

run;

Haikuo

mkeintz · Posted 09-04-2012 03:03 PM

Even in this case, I would avoid a hash. Whenever I have a choice between regular arrays and hash tables, I almost always find the arrays more straightforward, and they probably are more efficient at large scales.

Here's a two-step, (but one-pass solution) that uses arrays and a merge statement. It assumes only that all values of activity are integers in the range of 1 through 1000 (easily modified):

data vgrps (keep=subjid _grp:) /view=vgrps;

array _grp {1000};

do until (last.subjid);

set test1;

by subjid;

do _N_=coalesce(awake,sleep) to coalesce(sleep,awake);

_grp{_N_}=group;

end;

run;

data want (drop=_:);

array _grp {1000};

merge vgrps test2 (in=in2);

by subjid;

if in2 then do;

group=_grp{activity};

output;

end;

run;

Now, if you could be certain that every id appears in both test1 and test2, you could collapse this into a single step, but I wouldn't recommend it.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

DanielSantos · Posted 09-05-2012 06:25 AM

Haikuo, forget the previous code, from michtka last example everything is much clear.

Again, if tables are ordered by subjid,activity and subjid,awake,sleep and there's no overlapping intervals (has shown upside) by simply conditionally iterating through test1 (test2 would be the driver table) you can have the job done. No need for loop iterations.

data want (drop=sleep awake _:) ; * drop limits and temp vars;

set test2 ;

retain group;

* iterate through available limits;

do while (subjid > _subjid or (subjid = _subjid and (activity > sleep and sleep ne .)));

group=_group; * set default group (previous);

set test1 (rename=(subjid=_subjid group=_group));

end;

if (awake <= activity <= max(sleep,activity)) then group=_group; * set correct group;

run;

Here's the result running michtka's last sample, handling correctly values that are outside ranges.

Obs subjid activity group

1 1 15 1

2 1 25 1

3 1 35 2

4 1 49 2

5 1 58 5

6 2 71 1

7 2 86 2

8 2 90 3

9 2 104 3

10 2 115 2

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

Haikuo · Posted 09-05-2012 08:36 PM

Hi Daniel,

Slick approach. While it is worth pointing out that besides presorting, it also requires test1 only containing those subjid that existing in test2. Otherwise, instead of issuing a missing group value, it will assign a non-missing value, which does not make sense in this context.

Regards,

Haikuo

DanielSantos · Posted 09-06-2012 05:52 AM

Hi Haikuo. I wouldn't call it slick , I think it's actually pretty logical and gives a good example on datastep iterations control.

Also, good point there with the non match subjid, but that's an easy fix:

data want1 (drop=sleep awake _:) ; * drop limits and temp vars;

set test2_s ;

retain group;

do while (not _EOF and

(subjid > _subjid or (subjid = _subjid and (activity > sleep and sleep ne .))));

group=_group; * set default group (previous);

set test1 (rename=(subjid=_subjid group=_group)) end=_EOF;

end;

if subjid ne _subjid then group=.; * set missing for no match;

else if (awake <= activity <= max(sleep,activity)) then group=_group; * set correct group;

run;

Of course as I said, both dataset are required to be sorted, but I believe you're approach also needs test1 to be previously ordered.

With unsorted data both approaches would require three complete read/write dataset operations (sort(t1)+sort(t2)+step and sort(t1)+step+sql).

But I've run some rough tests for 400 random subjid/activity values (SAS 9.1.3 WIN), and sort(t2)[avg=0.01sec]+step[avg=0.01sec] seems to be taking less than half the time than step[avg=0.01sec]+sql[avg=0.05sec]. I really prefer step over sql (specially with large volumes of data), as you have more control over the former.

If by any means both tables happen to be previously sorted as needed, then I believe you can't go faster than the single datastep approach. Of course all of this is could be meaningless or meaningful, it really depends on how much data you have to process and machine time you're willing to spend.

But hey, very interesting discussion, always nice to see other points of view, and help others.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

michtka · Posted 09-06-2012 06:30 AM

Hi Daniel, sorry but your code doesn't work in my real problem.

It works in the example of my post, but not in the real one.

Thanks. V

DanielSantos · Posted 09-06-2012 07:17 AM

That's weird michtka, it should be fine with anything that follows your example's data.

Are both tables ordered like they should?

Well, anyway, never mind.

Cheers from Portugal.

Daniel Santos @ www.cgd.pt

michtka · Posted 09-06-2012 08:17 AM

Hi Daniel, i will try to this weekend to understand in deep your code, and the Hai.Kuo code, and I will give you feedback.

Thnaks.

michtka · Posted 09-09-2012 02:03 PM

about your reply number 4.

I think is better rename nobs by other name as for example last...nobs many times sounds confusing if you are learning code.

data want (keep=subjid activity group);

set test2 (rename=subjid=_subjid);

do n=1 to nobs ;

set test1 point=n nobs=last;

if subjid=_subjid and awake<=activity<=sleep then do; output; leave;end;

else if n=last then do; call missing(group); output;end;

end;

run;

michtka · Posted 09-09-2012 02:55 PM

* This is the final code not using sql

data test_1;

set test1;

by subjid;

set test1(firstobs=2 rename=awake=_awake keep=awake) test1(drop=_all_ obs=1);

/* _awake=ifn(last.subjid,.,_awake);*/ *doesnt work in SAS 9.0;

*alternative to the ifn function;

if last.subjid then _awake=.;

else _awake=_awake; *we can rid of that for this particle example;

run;

options missing=' ';

data want (keep=subjid activity group);

set test2 (rename=subjid=_subjid);

do n=1 to last ;

set test_1 point=n nobs=last;

/*put subjid= activity=;*/ *pri nt in the log the values of subjid and activity in every iteration;

if subjid=_subjid and ( awake<= activity<= input(scan(catx(' ',_awake,sleep,activity),1),best8.) ) then do; output; leave;end;

else if n=last then do;call missing(group); output;end;

end;

run;

Astounding · Posted 09-04-2012 03:26 PM

As long as there's still activity here, nobody mentioned formats as a solution. You can always permanently save the format and apply it as needed. For example, create the format using:

data create_format;

set test1 end=done;

retain fmtname '$group';

start = put(subjid,z2.) || put(awake,z2.);

if asleep=. then hlo='H';

else end = put(subjid,z2.) || put(asleep,z2.);

label=put(group,z2.);

output;

if done;

hlo='O';

label='NA';

output;

run;

proc format cntlin=create_format;

run;

From that point, it should be easy to use the format. For example:

data test2;

set test2;

test_string = put( put(subjid,z2.) || put(activity,z2.), $group. );

if test_string ne 'NA' then group = input(test_string,2.);

drop test_string;

run;

Just for the record, all of this is untested code, and adjustments to Z2 could easily be needed to reflect the width needed for actual data values.

Now that I posted this, I realize it still needs some work. The use of HLO=H cannot be applied to more than one observation. Perhaps to be continued, depending on when I have time and inclination!

grouping records

Re: grouping records

Re: grouping records

Re: grouping records

Re: grouping records

Re: grouping records

Re: grouping records

Re: grouping records

Re: grouping records

Re: grouping records

Re: grouping records

Re: grouping records

Re: grouping records

Re: grouping records

Re: grouping records

Re: grouping records

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!