Re: Help with filtering data

marleeakerson · Posted 11-05-2020 03:55 PM

Hello,

Using this sample data:

ID Code

1 A

1 B

1 C

2 A

2 B

3 A

3 B

3 C

I am trying to create a new table that pulls observations where an ID has both Code 'A' and Code 'C' associated with it.

So the final data would look like:

ID Code

1 A

1 B

1 C

3 A

3 B

3 C

Thank you!

Kurt_Bremser · Posted 11-05-2020 04:19 PM

Run a double DO loop, since it is sorted:

data want;
do until (last.id);
  set have;
  by id;
  if code = 'A' then _a = 1;
  if code = 'C' then _c = 1;
end;
do until (last.id);
  set have;
  by id;
  if _a and _c then output;
end;
drop _:;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

novinosrin · Posted 11-05-2020 05:08 PM

Hi @marleeakerson If you wanna try the PROC SQL aka "Readymeals" solution, it's worth having fun.

data have;
 input ID         Code $;
 cards;
1              A

1              B

1              C

2              A

2              B
2			   A

3              A

3              B

3              C
;



proc sql;
 create table want as
 select *
 from have
 group by id
 having count(distinct ifc(code in ('A','C'),code,' '))=2
 order by id,code;
quit;

Sir @ChrisHemedinger How on earth you folks could be so genius and creative to think of IFC/IFN making it too user friendly(aka readymeals) to a fault. Kudos!!!*1E6. Just an incredible thought to envision it in the first place. If the designer would accept an ordinary bloke's appreciation, please let him/her know the privileges we enjoy.

FreelanceReinh · Posted 11-05-2020 06:16 PM

Hi @novinosrin: I think the condition

having max(code='A') & max(code='C')

(or with SUM instead of MAX) would improve performance. (At least it did on my computer with 5.5 million obs. in HAVE of which 48% were selected: It was more than three times faster, but still much slower [about factor 2] than the double DOW loop.)

novinosrin · Posted 11-05-2020 07:34 PM

Yes I agree. Performance being the criteria, DOW is indeed the best. It's just that I find some of the Proc SQL functionality very trendy and user-friendly while being seated among a group of general business users. They just enjoy fastfood and done. 🙂 Even beginners seem to grasp so quickly. I guess that accomplishes much of our credit risk queries at ease.

ChrisHemedinger · Posted 11-06-2020 07:33 AM

Love to see the different creative solutions here.

IFC (and IFN) is a SAS function, not native to SQL. Sometimes we need a SAS function to do the job -- there are so many available and SQL doesn't have analogous functions for all of them. But you might pay a performance penalty for using a SAS function in PROC SQL especially when working with database data. SAS tries to push as much work to the database as possible, but a unique SAS function will force SAS to bring at least some of the data over and increase the I/O cost.

SAS For Dummies 3rd Edition! Check out the new edition, covering SAS 9.4, SAS Viya, and all of the modern ways to use SAS!

Help with filtering data