BookmarkSubscribeRSS Feed
deleted_user
Not applicable
My database contains several identities(ID). For each product(PROD) and ID, I should find the first occurrence where the outcome(OUTC) is Y. That record and all records for that PROD and ID, existing earlier, with regard to DATE2, should be selected.

Example for one, and only one, identity:

ID***DATE1******DATE2******PROD**OUTC
17**20041016**20041107***1010******N
17**20041016**20050131***1010******Y
17**20050718**20050919***1010******N
17**20050718**20060221***1010******Y
17**20070206**20070314***1227******N
17**20070206**20071021***1227******N
17**20080714**20081027***1624******Y

The selection for ID=17 should become:

ID***DATE1******DATE2******PROD**OUTC
17**20041016**20041107***1010******N
17**20041016**20050131***1010******Y
17**20080714**20081027***1624******Y

What is the best (fastest and/or easiest) solution?
14 REPLIES 14
Peter_C
Rhodochrosite | Level 12
the complexity of your join, (items prior to the target date on Y records) suggests an sql solution[pre]proc sql ;
create table selected_records
select a.*
from your.database a
join your.database b
on a.id = b.id
and a.prod = b.prod
where b.outC EQ 'Y'
and a.date2 LE b.date2
;
quit ;[/pre]
deleted_user
Not applicable
I suppose a and b are so called aliases.

But isn't the keyword AS missing somewhere?

And JOIN, is that really a keyword on it's own?
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
The DOC is quite useful and informative.

Scott Barry
SBBWorks, Inc.

SAS SQL Procedure User's Guide: Introduction to the SQL Procedure
http://support.sas.com/documentation/cdl/en/sqlproc/62086/HTML/default/a002536894.htm
deleted_user
Not applicable
If I put AS after CREATE TABLE SELECTED_RECORDS and before SELECT A.*, then the program at least runs.

But the created table SELECTED_RECORDS still contains 7 records. There should be 3 records.
ArtC
Rhodochrosite | Level 12
If you can get the SQL to work it will probably be for efficient than the following, which at least seems to work. It does read portions of the data twice and requires a sort, so may be less practical for large data sets.
[pre]data idprod;
input id DATE1 :YYMMdd. DATE2 :YYMMdd. PROD OUTC $;
format date1 date2 date9.;
datalines;
17 20041016 20041107 1010 N
17 20041016 20050131 1010 Y
17 20050718 20050919 1010 N
17 20050718 20060221 1010 Y
17 20070206 20070314 1227 N
17 20070206 20071021 1227 N
17 20080714 20081027 1624 Y
RUN;

proc sort data=idprod;
by id prod date2;
run;

data select(keep=id date1 date2 prod outc);
set idprod;
by id prod;
retain point start .
found ' ';
cnt+1;
if first.prod then do;
start=cnt;
found=' ';
end;
if outc='Y' & found=' ' then do point=start to cnt;
set idprod point=point;
output select;
found='x';
end;
run;
[/pre]
Ksharp
Super User
Hi.
I think your logic is too complex.and data-step conjoint with proc sql is perfect to process data.

[pre]
data idprod;
input id DATE1 :YYMMdd10. DATE2 :YYMMdd10. PROD OUTC $;
format date1 date2 yymmddn8.;
datalines;
17 20041016 20041107 1010 N
17 20041016 20050131 1010 Y
17 20050718 20050919 1010 N
17 20050718 20060221 1010 Y
17 20070206 20070314 1227 N
17 20070206 20071021 1227 N
17 20080714 20081027 1624 Y
RUN;

proc sort data=idprod;
by id prod date2;
run;

data index(where=(count eq 1 and outc eq 'Y'));
set idprod;
by id prod;
if first.prod then count=0;
if outc='Y' then count+1; * to identify the first occurrence of 'Y';
proc sql feedback;
create table selected_records as
select b.*
from index as a left join idprod as b
on a.id = b.id and a.prod = b.prod and b.date2 le a.date2;
quit;
proc print noobs;run;
[/pre]



Ksharp
deleted_user
Not applicable
Thank's to ArtC and Ksharp. Your solutions worked.

To ArtC: Could you explain the line retain point start .
found ' ';
? I know what retain does.

To Ksharp: Obviously it works to place PROC SQL inside a data step, but to me it seems a little odd. Although the PROC SQL-code is placed inside the data step it must mean that the data step is run and then the PROC SQL starts.
Any comments?

Real data consists of 21 127 records. With ArtC:s program 20 588 records are selected and with Ksharp:s program 20 611, so there must be some difference between what the programs do.
ArtC
Rhodochrosite | Level 12
In this case the RETAIN also initializes the variables and therefore also determines if they are numeric or character.
deleted_user
Not applicable
ArtC, I understand your solution now and I think it's correct.

Could anyone explain the "select b.*" in Ksharp:s solution?

A left join is done on a(alias for index) and b(alias for idprod) with the specified conditions, if I got it right.

Then I don't understand the "b" in "select b.*". Is it only a naming of the left join, where it is allowed to use "b" as a new alias?

Thank's for any help. I am trying to identify why the two solutions don't give the same result on real data.
Ksharp
Super User
Haha.
I think I found the answer.See the data below.
[pre]
17 20041016 20041107 1010 N
17 20041016 20050131 1010 Y
17 20041016 20050131 1010 N
17 20050718 20050919 1010 N
17 20050718 20060221 1010 Y
17 20070206 20070314 1227 N
17 20070206 20071021 1227 N
17 20080714 20081027 1624 Y
[/pre]

For third record,which has the same id prod with the last observation,but they have different value of outc, one is Y ,another is N.
So run the Art's code ,it also get three obs,But my code is four obs.
So we need to promote the proc sort.
[pre]
proc sort data=idprod;
by id prod date2 outc;
run;
[/pre]




Ksharp
deleted_user
Not applicable
Ksharp, the new sorting condition is needed in ArtC:s program to handle records like the new one you added.
In your program the new sorting is not needed, since the condition "where b.date2 le a.date2;" takes care of that.

I ran ArtC:s program on the real data with the new sorting, but the result didn't change, so I'll have to look for another reason for the different results the two programs give.

The explanation why your program selects more records is probably: If there are multiple records for a ID*PROD-combination where OUTC='Y', then all those records are selected by your program, but not with ArtC:s program.

What I want is the result of ArtC:s program.

I will delete the possible records mentioned, from the result of your program to check if the number of records then correspond.
Ksharp
Super User
>Obviously it works to place PROC SQL inside a data step, but to me it seems a little odd.

Hi. That is not insert proc sql, I forget RUN; and thank SAS to intelligently identify data-step and proc sql.
'b.*' stands for select any variable from idprod,not for the new alias.
I am not sure why there is different result for Me and Art.acctually i do not understand art's code totally.
the following is the code i change a little bit.
Indeed,You can use 'merge idprod index; by...' and 'where .....' to instead of proc sql.

[pre]
data idprod;
input id DATE1 :YYMMdd10. DATE2 :YYMMdd10. PROD OUTC $;
format date1 date2 yymmddn8.;
datalines;
17 20041016 20041107 1010 N
17 20041016 20050131 1010 Y
17 20050718 20050919 1010 N
17 20050718 20060221 1010 Y
17 20070206 20070314 1227 N
17 20070206 20071021 1227 N
17 20080714 20081027 1624 Y
RUN;

proc sort data=idprod;
by id prod date2;
run;

data index(where=(count eq 1 and outc eq 'Y'));
set idprod;
by id prod;
if first.prod then count=0;
if outc='Y' then count+1; * to identify the first occurrence of 'Y';
run;
proc sql feedback;
create table selected_records as
select b.id as _id,b.prod as _prod,b.date1 as _date1,b.date2 as _date2,b.outc as _outc
from index as a left join idprod as b
on a.id = b.id and a.prod = b.prod
where b.date2 le a.date2;
quit;
proc print noobs;run;
[/pre]


Ksharp
SPR
Quartz | Level 8 SPR
Quartz | Level 8
Hello EmestoC,

If your data always come in pairs, I mean for ID and Prod fixed, the NN or NY of OUTC combinations are only possible then the LAG function can be used as an alternative solution:
[pre]
data a;
retain t;
set idprod;
if first.prod then t=0;
ID0=LAG(ID); d10=LAG(Date1); d20=LAG(date2); P0=LAG(PROD); O0=LAG(OUTC);
if OUTC="Y" and t=0 then do;
t=1;
output;
if not first.prod then do;
ID=ID0; date1=d10; date2=d20; prod=p0; OUTC=O0;
output;
end;
end;
by ID PROD;
keep ID date1 date2 prod OUTC;
run;
[/pre]
The solution is not very simple...
Sincerely,
SPR
SPR
Quartz | Level 8 SPR
Quartz | Level 8
Hello EmestoC,

Another possible solution could be PROC IML that allows a user to manipulate matrices the way like it BASIC or C do. That is to process matrix rows in any way using indices.

SPR

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 1629 views
  • 0 likes
  • 6 in conversation