2 weeks ago
I am trying to pull data from a very large data warehouse using SAS EG. I was told I need to pull the data a specific way in order to get the data correctly. I have to use PROC SQL.
The Problem: In this data warehouse there are patient pharmacy records. We need to find the "needle in the haystack". We need to find any and all patients that have two or more IDs. In order to find them we will have records for a patient with the same SSN, but different IDs. In addition, I'm a Base SAS Programmer and I'm using SAS EG version 7.00 HF5 (188.8.131.5256) (64 bit), so I'm not doing a project, just trying to run the program.
To test, I was only pulling one week of data, just to see if my code runs, however, since it is large, the system only wants to pull back 1M rows. So, I figured if I can only pull back those records that have two or more IDs for the same SSN, I should have a very small dataset. However, I'm failing.
Below is my code:
proc sql; create table multiple_ssns as select distinct t2.MemberProfile, t2.MemberSSN FROM SQLODBC.Claim_Fact t1 INNER JOIN SQLODBC.Patient_Dimension t2 ON (t1.PatientKey=t2.PatientKey) WHERE (t1.DateDispensedKey BETWEEN 20170501 AND 20170530) AND t1.claimstatus = 'Paid' ; quit;
Now, before you tell me this code won't work correctly, I know. This is the code that gives me more than 1M records. I tried adding a Group by and Having statement, but that didn't work as expected, either (see below).
GROUP BY t2.MemberSSN HAVING count(Distinct t1.MemberProfile) > 1
I only need a list of SSNs and IDs. I expect to see output (in a dataset) something like this with less than 100 observations (we know there aren't very many):
If you have any suggestions, I'd love to try them out. P.S. I did try to do a Proc Freq after the Proc SQL data pull, but I got an out of resources error. So, I was advised by the other programmer here that we should do all data pulls with criteria using Proc SQL.
2 weeks ago
I wouldn't expect either Group By or Having to restrict your input to 1M rows since Group By operates on the result set and Having operates on the result of aggregate functions so therefore also would not affect the input. You appear to be using ODBC but I think altering the first line of your code to:
Proc SQL inobs=1000000;
should still work. Only 1M rows should be read from the DB table
2 weeks ago
I would try something like this:
proc sql; create table multiple_ssns as select t2.MemberProfile, t2.MemberSSN, count(*) as SSN_Count FROM SQLODBC.Claim_Fact t1 INNER JOIN SQLODBC.Patient_Dimension t2 ON (t1.PatientKey=t2.PatientKey) WHERE (t1.DateDispensedKey BETWEEN 20170501 AND 20170530) AND t1.claimstatus = 'Paid' GROUP BY t2.MemberProfile, t2.MemberSSN HAVING count(*) > 1 ; quit;