Solved: Re: SAS selecting rows from historical dataset

sivakoya · Posted 02-07-2017 04:00 PM

Hi,

I have following dataset.

data test;

input id date yymmdd8. X Y;

format date date9.;

datalines;

1 20160131 1 1

1 20160229 1 2

1 20160331 2 3

1 20160430 3 4

1 20160531 3 5

1 20160630 3 6

2 20160131 1 1

2 20160229 1 2

2 20160331 2 3

2 20160430 2 4

2 20160531 3 5

2 20160630 3 6

3 20160131 1 1

3 20160229 4 2

3 20160331 9 3

3 20160430 9 4

;

run;

id	date	X	Y
1	31Jan2016	1	1
1	29Feb2016	1	2
1	31Mar2016	2	3
1	30Apr2016	3	4
1	31May2016	3	5
1	30Jun2016	3	6
2	31Jan2016	1	1
2	29Feb2016	1	2
2	31Mar2016	2	3
2	30Apr2016	2	4
2	31May2016	3	5
2	30Jun2016	3	6
3	31Jan2016	1	1
3	29Feb2016	4	2
3	31Mar2016	9	3
3	30Apr2016	9	4

From this I need to select the record for each ID, at the instance when (X > 2) for the first time.

So from the above dataset my output should be:

id	date	X	Y
1	30Apr2016	3	4
2	31May2016	3	5
3	29Feb2016	4	2

I tried doing:

data test;

input id date yymmdd8. X Y;

format date date9.;

datalines;

1 20160131 1 1

1 20160229 1 2

1 20160331 2 3

1 20160430 3 4

1 20160531 3 5

1 20160630 3 6

2 20160131 1 1

2 20160229 1 2

2 20160331 2 3

2 20160430 2 4

2 20160531 3 5

2 20160630 3 6

3 20160131 1 1

3 20160229 4 2

3 20160331 9 3

3 20160430 9 4

;

run;

proc sort data=test out=test1;

by id descending x date;

run;

data test2;

set test1;

by id;

if first.id then output;

run;

But this would give me:

id	date	X	Y
1	30Apr2016	3	4
2	31May2016	3	5
3	31Mar2016	9	3

Wrong entry for Id=3.

Any easy way to do this on a dataset with 60million records.

Thanks

Astounding · Posted 02-07-2017 04:25 PM

I'm sure you will get to choose from a few possibilities. Here's one that assumes your data set is sorted by ID DATE:

data want;

need_a_record='Y';

do until (last.id);

set have;

by id;

if need_a_record='Y' and X > 2 then do;

output;

need_a_record='N';

end;

if need_a_record='Y' then output;

drop need_a_record;

run;

*** Statement added to handle the change in requirements.

View solution in original post

Astounding · Posted 02-07-2017 04:25 PM

I'm sure you will get to choose from a few possibilities. Here's one that assumes your data set is sorted by ID DATE:

data want;

need_a_record='Y';

do until (last.id);

set have;

by id;

if need_a_record='Y' and X > 2 then do;

output;

need_a_record='N';

end;

if need_a_record='Y' then output;

drop need_a_record;

run;

*** Statement added to handle the change in requirements.

sivakoya · Posted 02-07-2017 05:54 PM

@Astounding @ballardwthanks

ballardw · Posted 02-07-2017 04:34 PM

If not sorted:

proc sort data=test ;
   by id date;
run;

data want;
   set subset (where=(x >2));
   by id date;
   if first.id;
run;

If sorted by id and date then the datastep should suffice assuming you do not want any id if there is never a value greater than 2.

If you do want something when there are not any values for x > 2 then we need a rule.

sivakoya · Posted 02-07-2017 04:48 PM

@ballardw @Astounding Sorry as i think throught it, i shouldnt drops the ones that never went x>2. So for the ones which has X < 2, i would need to retain the most latest record. In the data below ID-4 has X=1 throughout, so I need to retain the last record for 4, which is 4 20160430 1 4.

data test;

input id date yymmdd8. X Y;

format date date9.;

datalines;

1 20160131 1 1

1 20160229 1 2

1 20160331 2 3

1 20160430 3 4

1 20160531 3 5

1 20160630 3 6

2 20160131 1 1

2 20160229 1 2

2 20160331 2 3

2 20160430 2 4

2 20160531 3 5

2 20160630 3 6

3 20160131 1 1

3 20160229 4 2

3 20160331 9 3

3 20160430 9 4

4 20160131 1 1

4 20160229 1 2

4 20160331 1 3

4 20160430 1 4

5 20160131 1 1

5 20160229 1 2

5 20160331 2 3

5 20160430 2 4

;

run;

thanks

ballardw · Posted 02-07-2017 07:19 PM

Another approach:

proc sort data=test;
   by id date;
run;
data want;
   set test;
   by id date;
   retain found;
   if first.id then  found=0;
   if found=0 and x>2 then do;
      output;
      found=1;
   end;
   if last.id and found=0 then output;
   drop found;
run;

Please post code in code boxes opened with the {i} icon. Sometimes we get artifacts from pasting in the main window as your code gets transformed to html in the main box of the message. For instance copy some of your code from the forum and paste back into your editor. See any difference?

sivakoya · Posted 02-08-2017 03:11 PM

data have;
input id date yymmdd8. X Y;
format date date9.;
datalines;
1 20160131 1 1
1 20160229 1 2
1 20160331 2 3
1 20160430 3 4
1 20160531 3 5
1 20160630 3 6
2 20160131 1 1
2 20160229 1 2
2 20160331 2 3
2 20160430 2 4
2 20160531 3 5
2 20160630 3 6
3 20160131 1 1
3 20160229 4 2
3 20160331 9 3
3 20160430 9 4
;
run;



data want;
set have;
prev_x = lag(X);
if prev_x > 2 and x > 2 then bad_now = 'N';
else do; if x>2 then bad_now='Y'; else bad_now='N'; end;
run;

data final;
set want;
where bad_now='Y';
run;

@ballardwthanks. I have thought of another approach 🙂

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away