Using PROC SQL with WHERE to select if character values are identical?

Accepted Solution Solved
Reply
Contributor SM1
Contributor
Posts: 43
Accepted Solution

Using PROC SQL with WHERE to select if character values are identical?

I'm working with repeated measures medical data. Some patients were seen by only 1 provider while other patients were seen by multiple providers. I need to separate patient records into 2 tables: patients seen by only 1 provider and patients seen by 2 or more providers.

 

The data is sorted by patient ID (PATID - character variable) and then appointment date (APPT_DATE - SAS date). Providers are identified by their unique code (PROVID) which is a mixture of letters and numbers - so character values.

 

A few lines of data might look like:

 

PATID            APPT_DATE        PROVID

12345           12/05/2015         SMITH123

12345           01/05/2016         SMITH123

45678           12/10/2015         JONES834

45678           01/15/2016         SMITH123

 

Here's the (incomplete) code I was planning to use:

 

proc sql;

    create table single_provider as select * from all_patients

    where PATID in (select PATID from all_patients where (?????????);

quit;

 

I'm trying to figure out what syntax could be used for either of these 2 conditional clauses:

 

1. select PATID only if all PROVID values are identical for that PATID

2. select PATID if there are 2 or more different PROVID values for that PATID 

 

Any suggestions will be most appreciated.

 

Thanks!

 


Accepted Solutions
Solution
‎04-28-2016 07:21 PM
Trusted Advisor
Posts: 1,116

Re: Using PROC SQL with WHERE to select if character values are identical?

Hi @SM1,

 

Try this:

proc sql;
create table single_provider as
select * from all_patients
group by patid
having count(distinct provid)=1
order by patid, appt_date;

create table mult_providers as
select * from all_patients
group by patid
having count(distinct provid)>1
order by patid, appt_date;
quit;

 

View solution in original post


All Replies
Solution
‎04-28-2016 07:21 PM
Trusted Advisor
Posts: 1,116

Re: Using PROC SQL with WHERE to select if character values are identical?

Hi @SM1,

 

Try this:

proc sql;
create table single_provider as
select * from all_patients
group by patid
having count(distinct provid)=1
order by patid, appt_date;

create table mult_providers as
select * from all_patients
group by patid
having count(distinct provid)>1
order by patid, appt_date;
quit;

 

Trusted Advisor
Posts: 1,116

Re: Using PROC SQL with WHERE to select if character values are identical?

Alternatively, you could use a data step with double DOW loop:

data single_provider mult_providers;
do until(last.patid);
  set all_patients;
  by patid;
  if first.patid then provid1=provid;
  else if provid ne provid1 then mult=1;
end;
do until(last.patid);
  set all_patients;
  by patid;
  if mult then output mult_providers;
  else output single_provider;
end;
drop mult provid1;
run;
Contributor SM1
Contributor
Posts: 43

Re: Using PROC SQL with WHERE to select if character values are identical?

Thanks @FreelanceReinhard!

 

The PROC SQL solution works perfectly. 2,275 observations when into the new table SINGLE_PROVIDER and 1,941 observations went into the new table MULT_PROVIDERS.

 

Something is off with the double DOW loop syntax. I was thinking that in the 2nd DO loop, the conditional statement:

 

 if mult then output mult_providers;

needs to be changed to:

 

if mult = 1 then output mult_providers;

 

but the result was the same.

 

0 observations went into the table SINGLE_PROVIDER and 4,216 observations went into the table MULT_PROVIDER.

 

 

Trusted Advisor
Posts: 1,116

Re: Using PROC SQL with WHERE to select if character values are identical?

How strange. I've not been able to replicate your issue. There is one known difference between the solutions, though (which I should have mentioned): PROC SQL's DISTINCT count ignores missing values. The data step approach in its current version, however, would regard a missing vs. a non-missing value of PROVID as two providers. So, the results can differ if variable PROVID has missing values in some observations.

 

But this will hardly explain the extreme result that you describe (0 obs. in SINGLE_PROVIDER).

 

With your sample data (and with an extended version of it that I used for testing purposes), however, I obtained identical results with both approaches, as long as PROVID was always non-missing.

 

The criterion "mult" should be equivalent to "mult=1" here (hence, the results after changing just this must be the same), because "mult" alone means "mult is neither 0 nor missing." In fact, variable MULT is either 1 or missing at every point in time in this data step.

 

Do you get the same incorrect result with your sample data?

 

If not, how does your real data differ from the sample data?

 

Or did you modify the data step slightly (perhaps to adapt it to the structure of your real data)? If so, please show the modifications.

Contributor SM1
Contributor
Posts: 43

Re: Using PROC SQL with WHERE to select if character values are identical?

The data is set up as shown with the sample data provided.

There is no missing data in any of the variables referenced in the syntax (i.e., patient ID, appointment date, and provider ID).

 

This is from the log after re-running the syntax:

 

NOTE: There were 4216 observations read from the data set WORK.IRREGULAR_1GAP_CM.
NOTE: There were 4216 observations read from the data set WORK.IRREGULAR_1GAP_CM.
NOTE: The data set WORK.S_PROVIDER has 0 observations and 893 variables.
NOTE: The data set WORK.M_PROVIDERS has 4216 observations and 893 variables.

 

 

I did slightly change the syntax.

 

The name of the SET table is irregular_1gap_cm (instead of all_patients).

The provider ID variable's name is PROVID1 (so I used PROVID2 as the new variable).

I changed the name of the 2 newly generated tables (since I used the other names with the PROC SQL syntax) to s_provider and m_providers.

 

This is the syntax I ran:

 

data s_provider m_providers;
do until(last.patid);
  set irregular_1gap_cm;
  by patid;
  if first.patid then provid2=provid1;
  else if provid1 ne provid2 then mult=1;
end;
do until(last.patid);
  set irregular_1gap_cm;
  by patid;
  if mult then output m_providers;
  else output s_provider;
end;
drop mult provid2;
run;

Trusted Advisor
Posts: 1,116

Re: Using PROC SQL with WHERE to select if character values are identical?

Thanks for providing the details. I've just run your code on simulated data comprising 2,001,129 observations. Result:

NOTE: There were 2001129 observations read from the data set WORK.IRREGULAR_1GAP_CM.
NOTE: There were 2001129 observations read from the data set WORK.IRREGULAR_1GAP_CM.
NOTE: The data set WORK.S_PROVIDER has 684077 observations and 3 variables.
NOTE: The data set WORK.M_PROVIDERS has 1317052 observations and 3 variables.

And the result is the same as from PROC SQL.

 

There does not happen to be a variable MULT among the 893 variables in your dataset IRREGULAR_1GAP_CM, or?

 

If not, I would select one PATID from the "S_PROVIDER" dataset produced by PROC SQL and apply the data step to just the observations of this PATID.

Trusted Advisor
Posts: 1,116

Re: Using PROC SQL with WHERE to select if character values are identical?

An existing variable PROVID2 in dataset IRREGULAR_1GAP_CM could explain the issue as well.

Contributor SM1
Contributor
Posts: 43

Re: Using PROC SQL with WHERE to select if character values are identical?

@FreelanceReinhard - thanks so much for all of your help!

 

With some 843 variables, I should have first checked to make sure there was not already a PROVID2 variable.

 

There is.

 

I'm so sorry - and thank you so much for all of your help.

 

Both solutions work perfectly!

 

And I now better understand the DOW syntax - which will really help in the next few weeks.

 

Thank you so much!

Trusted Advisor
Posts: 1,116

Re: Using PROC SQL with WHERE to select if character values are identical?

This is good news! Thank you very much.

 

So, my lesson learned is: Avoid name conflicts with unknown datasets by using "rare" variable names (e.g. with leading or trailing underscores and the like).

 

Now I'll be able to sleep (here in Central Europe). :-)

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 588 views
  • 1 like
  • 2 in conversation