Re: Selecting records.

deleted_user · Posted 11-21-2010 05:39 AM

My database contains several identities(ID). For each product(PROD) and ID, I should find the first occurrence where the outcome(OUTC) is Y. That record and all records for that PROD and ID, existing earlier, with regard to DATE2, should be selected.

Example for one, and only one, identity:

ID***DATE1******DATE2******PROD**OUTC
17**20041016**20041107***1010******N
17**20041016**20050131***1010******Y
17**20050718**20050919***1010******N
17**20050718**20060221***1010******Y
17**20070206**20070314***1227******N
17**20070206**20071021***1227******N
17**20080714**20081027***1624******Y

The selection for ID=17 should become:

ID***DATE1******DATE2******PROD**OUTC
17**20041016**20041107***1010******N
17**20041016**20050131***1010******Y
17**20080714**20081027***1624******Y

What is the best (fastest and/or easiest) solution?

Peter_C · Posted 11-21-2010 12:31 PM

the complexity of your join, (items prior to the target date on Y records) suggests an sql solution[pre]proc sql ;
create table selected_records
select a.*
from your.database a
join your.database b
on a.id = b.id
and a.prod = b.prod
where b.outC EQ 'Y'
and a.date2 LE b.date2
;
quit ;[/pre]

deleted_user · Posted 11-21-2010 01:19 PM

I suppose a and b are so called aliases.

But isn't the keyword AS missing somewhere?

And JOIN, is that really a keyword on it's own?

sbb · Posted 11-21-2010 02:39 PM

The DOC is quite useful and informative.

Scott Barry
SBBWorks, Inc.

SAS SQL Procedure User's Guide: Introduction to the SQL Procedure
http://support.sas.com/documentation/cdl/en/sqlproc/62086/HTML/default/a002536894.htm

deleted_user · Posted 11-21-2010 04:07 PM

If I put AS after CREATE TABLE SELECTED_RECORDS and before SELECT A.*, then the program at least runs.

But the created table SELECTED_RECORDS still contains 7 records. There should be 3 records.

ArtC · Posted 11-21-2010 06:13 PM

If you can get the SQL to work it will probably be for efficient than the following, which at least seems to work. It does read portions of the data twice and requires a sort, so may be less practical for large data sets.
[pre]data idprod;
input id DATE1 :YYMMdd. DATE2 :YYMMdd. PROD OUTC $;
format date1 date2 date9.;
datalines;
17 20041016 20041107 1010 N
17 20041016 20050131 1010 Y
17 20050718 20050919 1010 N
17 20050718 20060221 1010 Y
17 20070206 20070314 1227 N
17 20070206 20071021 1227 N
17 20080714 20081027 1624 Y
RUN;

proc sort data=idprod;
by id prod date2;
run;

data select(keep=id date1 date2 prod outc);
set idprod;
by id prod;
retain point start .
found ' ';
cnt+1;
if first.prod then do;
start=cnt;
found=' ';
end;
if outc='Y' & found=' ' then do point=start to cnt;
set idprod point=point;
output select;
found='x';
end;
run;
[/pre]

Ksharp · Posted 11-22-2010 02:58 AM

Hi.
I think your logic is too complex.and data-step conjoint with proc sql is perfect to process data.

[pre]
data idprod;
input id DATE1 :YYMMdd10. DATE2 :YYMMdd10. PROD OUTC $;
format date1 date2 yymmddn8.;
datalines;
17 20041016 20041107 1010 N
17 20041016 20050131 1010 Y
17 20050718 20050919 1010 N
17 20050718 20060221 1010 Y
17 20070206 20070314 1227 N
17 20070206 20071021 1227 N
17 20080714 20081027 1624 Y
RUN;

proc sort data=idprod;
by id prod date2;
run;

data index(where=(count eq 1 and outc eq 'Y'));
set idprod;
by id prod;
if first.prod then count=0;
if outc='Y' then count+1; * to identify the first occurrence of 'Y';
proc sql feedback;
create table selected_records as
select b.*
from index as a left join idprod as b
on a.id = b.id and a.prod = b.prod and b.date2 le a.date2;
quit;
proc print noobs;run;
[/pre]

Ksharp

deleted_user · Posted 11-22-2010 05:53 AM

Thank's to ArtC and Ksharp. Your solutions worked.

To ArtC: Could you explain the line retain point start .
found ' ';? I know what retain does.

To Ksharp: Obviously it works to place PROC SQL inside a data step, but to me it seems a little odd. Although the PROC SQL-code is placed inside the data step it must mean that the data step is run and then the PROC SQL starts.
Any comments?

Real data consists of 21 127 records. With ArtC:s program 20 588 records are selected and with Ksharp:s program 20 611, so there must be some difference between what the programs do.

ArtC · Posted 11-22-2010 02:48 PM

In this case the RETAIN also initializes the variables and therefore also determines if they are numeric or character.

deleted_user · Posted 11-22-2010 04:07 PM

ArtC, I understand your solution now and I think it's correct.

Could anyone explain the "select b.*" in Ksharp:s solution?

A left join is done on a(alias for index) and b(alias for idprod) with the specified conditions, if I got it right.

Then I don't understand the "b" in "select b.*". Is it only a naming of the left join, where it is allowed to use "b" as a new alias?

Thank's for any help. I am trying to identify why the two solutions don't give the same result on real data.

Ksharp · Posted 11-22-2010 09:24 PM

Haha.
I think I found the answer.See the data below.
[pre]
17 20041016 20041107 1010 N
17 20041016 20050131 1010 Y
17 20041016 20050131 1010 N
17 20050718 20050919 1010 N
17 20050718 20060221 1010 Y
17 20070206 20070314 1227 N
17 20070206 20071021 1227 N
17 20080714 20081027 1624 Y
[/pre]

For third record,which has the same id prod with the last observation,but they have different value of outc, one is Y ,another is N.
So run the Art's code ,it also get three obs,But my code is four obs.
So we need to promote the proc sort.
[pre]
proc sort data=idprod;
by id prod date2 outc;
run;
[/pre]

Ksharp

deleted_user · Posted 11-25-2010 03:14 AM

Ksharp, the new sorting condition is needed in ArtC:s program to handle records like the new one you added.
In your program the new sorting is not needed, since the condition "where b.date2 le a.date2;" takes care of that.

I ran ArtC:s program on the real data with the new sorting, but the result didn't change, so I'll have to look for another reason for the different results the two programs give.

The explanation why your program selects more records is probably: If there are multiple records for a ID*PROD-combination where OUTC='Y', then all those records are selected by your program, but not with ArtC:s program.

What I want is the result of ArtC:s program.

I will delete the possible records mentioned, from the result of your program to check if the number of records then correspond.

Ksharp · Posted 11-22-2010 09:07 PM

>Obviously it works to place PROC SQL inside a data step, but to me it seems a little odd.

Hi. That is not insert proc sql, I forget RUN; and thank SAS to intelligently identify data-step and proc sql.
'b.*' stands for select any variable from idprod,not for the new alias.
I am not sure why there is different result for Me and Art.acctually i do not understand art's code totally.
the following is the code i change a little bit.
Indeed,You can use 'merge idprod index; by...' and 'where .....' to instead of proc sql.

[pre]
data idprod;
input id DATE1 :YYMMdd10. DATE2 :YYMMdd10. PROD OUTC $;
format date1 date2 yymmddn8.;
datalines;
17 20041016 20041107 1010 N
17 20041016 20050131 1010 Y
17 20050718 20050919 1010 N
17 20050718 20060221 1010 Y
17 20070206 20070314 1227 N
17 20070206 20071021 1227 N
17 20080714 20081027 1624 Y
RUN;

proc sort data=idprod;
by id prod date2;
run;

data index(where=(count eq 1 and outc eq 'Y'));
set idprod;
by id prod;
if first.prod then count=0;
if outc='Y' then count+1; * to identify the first occurrence of 'Y';
run;
proc sql feedback;
create table selected_records as
select b.id as _id,b.prod as _prod,b.date1 as _date1,b.date2 as _date2,b.outc as _outc
from index as a left join idprod as b
on a.id = b.id and a.prod = b.prod
where b.date2 le a.date2;
quit;
proc print noobs;run;
[/pre]

Ksharp

SPR · Posted 11-22-2010 11:07 AM

Hello EmestoC,

If your data always come in pairs, I mean for ID and Prod fixed, the NN or NY of OUTC combinations are only possible then the LAG function can be used as an alternative solution:
[pre]
data a;
retain t;
set idprod;
if first.prod then t=0;
ID0=LAG(ID); d10=LAG(Date1); d20=LAG(date2); P0=LAG(PROD); O0=LAG(OUTC);
if OUTC="Y" and t=0 then do;
t=1;
output;
if not first.prod then do;
ID=ID0; date1=d10; date2=d20; prod=p0; OUTC=O0;
output;
end;
end;
by ID PROD;
keep ID date1 date2 prod OUTC;
run;
[/pre]
The solution is not very simple...
Sincerely,
SPR

SPR · Posted 11-23-2010 10:38 AM

Hello EmestoC,

Another possible solution could be PROC IML that allows a user to manipulate matrices the way like it BASIC or C do. That is to process matrix rows in any way using indices.

SPR

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away