Solved: Re: By Variables Not Properly Sorted

JediApprentice · Posted 10-26-2016 02:52 PM

Here is my code:

proc sort data = work.Calculated_Data2;
  by AffiliationID ApprovedPriceType BRAND_GENERIC CarrierID ClaimOriginationFlag 
     ClaimStatus CLIENT CostTypeCode DISPENSEASWRITTENDAWPRODUCTSELE 
     ExtendedSupplyNetworkIndicator FilledMonth FilledQuarter FinalPlanCode GPINumber 
     MailOrderIndicator MultiSourceCode PlanDrugStatus PriceScheduleDefinition RxNetworkID 
     SERVICEPROVIDERID SuperPharmacyNetworkID PHARMACYZIPCODE Specialty_Flag EXCLUSIONS 
     IVLIndicator;
run; 

data Calculated_Data3;
  format Discount_HCSC Discount_AL Discount_FL Discount_Medd
         Discount_MN Discount_NC Discount_Prime Discount_Horizon Percent8.4;
  length Client_Specialty_Flag $30;
  if _N_ = 1 then do;
    declare Hash S (dataset:"egtask.SPECIALTY_LIST");
	  S.definekey("ProductServiceID");
	  S.definedata("Discount_HCSC");
	  S.definedata("Discount_AL");
	  S.definedata("Discount_FL");
	  S.definedata("Discount_MEDD");
	  S.definedata("Discount_MN");
	  S.definedata("Discount_NC");
	  S.definedata("Discount_Prime");
	  S.definedata("Discount_Horizon");
	  S.definedone();
	end;
  set Calculated_Data2;
  if S.find() = 0 then do;
    Discount_HCSC=Discount_HCSC;
    Discount_AL=Discount_AL;
    Discount_FL=Discount_FL;
    Discount_Horizon=Discount_Horizon;
    Discount_Prime=Discount_Prime;
    Discount_MN=Discount_MN;
    Discount_NC=Discount_NC;
    Discount_MEDD=Discount_MEDD;
  end;
  if CLIENT in("IL","NM","OK","TX","MT","NM") and Discount_HCSC ne . then Client_Specialty_Flag = "Specialty";
  else if CLIENT in("KS","ND","NE","WY") and Discount_Prime ne . then Client_Specialty_Flag = "Specialty";
  else if CLIENT = "MN" and Discount_MN ne . then Client_Specialty_Flag = "Specialty";
  else if CLIENT = "AL" and Discount_AL ne . then Client_Specialty_Flag = "Specialty";
  else if CLIENT = "FL" and Discount_FL ne . then Client_Specialty_Flag = "Specialty";
  else if CLIENT = "NJ" and Discount_Horizon ne . then Client_Specialty_Flag = "Specialty";
  else if CLIENT = "NC" and Discount_NC ne . then Client_Specialty_Flag = "Specialty";
  else Client_Specialty_Flag = "Non-Specialty";
run;

data EGTASK.Data_Put_In_Access;
  set work.Calculated_Data3;
  format AffiliationID $6. ApprovedPriceType $10. CarrierID $9. ClaimOriginationFlag $1.
		 ClaimStatus $1. CostTypeCode $10. DISPENSEASWRITTENDAWPRODUCTSELE $1.
		 ExtendedSupplyNetworkIndicator $1. FilledMonth $7. FilledQuarter $6.
		 GPINumber $14. MailOrderIndicator $1. MultiSourceCode $1. PlanDrugStatus $1.
		 PriceScheduleDefinition $10. RxNetworkID $6. RxNetworkID_CHAIN $6.
         SERVICEPROVIDERID $15. SuperPharmacyNetworkID $6. PHARMACYZIPCODE $10.
		 SUM_of_ApprovedIngredientCost DOLLAR16.2 SUM_of_ApprDispFeeAmount DOLLAR16.2 SUM_of_QUANTITYDISPENSED 13.3 ZIP $10.;
  RxNetworkID_CHAIN = RxNetworkID;
  SUM_of_ApprovedIngredientCost = sum(ApprovedIngredientCost);
  SUM_of_ApprDispFeeAmount = sum(ApprovedDispensingFeeAmount);
  SUM_of_AWP = sum(AWP);
  SUM_of_RX = sum(RX);
  SUM_of_QUANTITYDISPENSED = sum(QUANTITYDISPENSED);
  SUM_of_MEMBERID = sum(MemberID);
  ZIP = PHARMACYZIPCODE;
  by AffiliationID ApprovedPriceType BRAND_GENERIC CarrierID ClaimOriginationFlag 
     ClaimStatus CLIENT CostTypeCode DISPENSEASWRITTENDAWPRODUCTSELE 
     ExtendedSupplyNetworkIndicator FilledMonth FilledQuarter FinalPlanCode GPINumber 
     MailOrderIndicator MultiSourceCode PlanDrugStatus PriceScheduleDefinition RxNetworkID 
     SERVICEPROVIDERID SuperPharmacyNetworkID PHARMACYZIPCODE Specialty_Flag Client_Specialty_Flag EXCLUSIONS 
     IVLIndicator;
run;

The error occurs when I get to the last datastep (in the by). It give me the error: "Variables are not properly sorted in dataset WORK.Calculated_Data3". The weird thing is, earlier on in the project, Calculated_Data2 is created with a subsetting condition based on a range of dates. When I run it for one month, it runs through this code and works with no errors. But when I run it for 4 months, it throws the error. What could be going on here?

Shmuel · Posted 10-27-2016 09:55 PM

@JediApprentice have an example of each month is like:

1 non_speicalty

2 non_specialty

3 specialty

4 specialty

so every month by itself is sorted ascending ok.

now take two month, then you will get a non sorted data:

1 non_speicalty

2 non_specialty

3 specialty

4 specialty

1 non_speicalty <<<<< breaking point of ascending sort

2 non_specialty

3 specialty

4 specialty

View solution in original post

Shmuel · Posted 10-26-2016 03:13 PM

Depending on SAS instalation and SAS system options, in case of multi-threading (parallel computing)

output of a datastep (your data2) may be not sorted. In such case you need to sort it again.

As much as I know there is an option to keep output sorted. Need search for appropriate option.

Astounding · Posted 10-26-2016 03:21 PM

Your final BY statement includes CLIENT_SPECIALTY_FLAG. But that variable was assigned a value in the DATA step. Why would you expect the data to be in order by this newly created variable?

JediApprentice · Posted 10-26-2016 03:27 PM

@Astounding Right, but when I create Calculated_Data2 in a previous step in the project, I subset by a range of dates. When I run that with a date period of one month, everything sorts fine and produces correct results with no errors, but when I run it for multiple months, I suddenly get sort error.

Astounding · Posted 10-26-2016 04:09 PM

It's got to be data-related. Unfortunately, I can't really examine your data. But something along these lines: perhaps for a single month, CLIENT_SPECIALTY_FLAG is always "Non-Specialty".

Shmuel · Posted 10-26-2016 04:12 PM

you entered the Client_Specialty_Flag on BY statement before IVLIndicator (the minor BY variables);

In case of changing flag value the order is spoiled.

JediApprentice · Posted 10-26-2016 05:05 PM

@Shmuel Could you explain further I'm not sure I understand...

Shmuel · Posted 10-26-2016 07:12 PM

Let's say the by statment was: BY X Z

where X Z are:

1 1

1 2

You add a new variable Y and change to BY X Y Z

so data is now:

X Y Z

1 B 1

1 A 2

then data is not sorted by X Y Z.

JediApprentice · Posted 10-27-2016 11:36 AM

@Shmuel Okay, I see what's you're saying. But it should have failed then for the smaller dataset. I should have gotten the same error... It can't be just a matter of size. Am I right? As @Astounding suggested, it may be because Client_Specialty_Flag has only one value in the smaller data set. But I looked and it has two values: "Non-Specialty" and "Specialty". Also, I think I may be misunderstanding what the by statement in the data step is supposed to do. I'm under the impression that it simply orders by those variables. Is this correct?

Astounding · Posted 10-27-2016 12:52 PM

You have much to learn, young Jedi.

The BY statement does not order the observations. Rather, it is permitted when the observations are already in order. Its purpose is to set up flags that inform you when a group of observations (same values for all of the BY variables) begins or ends.

The are plenty of details. What are the names of these flags? What values are they assigned? How can you do this if your data is grouped, but not sorted? It's a very important topic, worth the time to study.

Shmuel · Posted 10-27-2016 01:03 PM

When you run PROC SORT data=...; BY <list of keys>;
you define how to sort/order the data;

But, when you use Data out; Set input; BY <list of keys>; ... ( or other proc with BY )

you declare that the program may assume that data is already sorted by those keys,

in the order of variables given in the list.

You need data to be sorted in order to calculate data per group.

The group is defined by the whole list (e.g. BY a b c d ) or by part of it (like: BY a b only).

When you deal with small dataset, probably the keys were in right order;

As you increase in data the probability to "break" the order is raising.

You can use OPTIONS OBS= n; to deal with different number of observations

(e.g: try 1000, if ok try again by 2000 else try by 500, etc. and then look at the dataset keys to identify where

the order is broken);

in yout specific case "Non-Specialty" should preceed "Specialty" as "N" comes before "S" in alphabetic order.

if you find any "Non-Specialty" comming after "Specialty", this is a broken order point.

I hope I clarified the double meaning of BY statemnet. (I know that my english is not the best);

JediApprentice · Posted 10-27-2016 02:22 PM

@Shmuel Okay I think I for the most part understand by group processing now. What I still don't understand is when you said

"When you deal with small dataset, probably the keys were in right order; As you increase in data the probability to "break" the order is raising." I'm confused as to why a greater number of observations may cause the order to break. I don't see why the keys would be in a different order just because you added more data to the dataset.

ballardw · Posted 10-27-2016 03:09 PM

@JediApprentice wrote:

@Shmuel Okay I think I for the most part understand by group processing now. What I still don't understand is when you said

"When you deal with small dataset, probably the keys were in right order; As you increase in data the probability to "break" the order is raising." I'm confused as to why a greater number of observations may cause the order to break. I don't see why the keys would be in a different order just because you added more data to the dataset.

If you have a data set that you sort by 15 different variables but do not have very many combinations it may be that the resulting sort order is effectively the same

Sort by a b c for:

a b c

1 2 3

1 2 4

1 2 5

is the same result as sort by a c.

but if your next data set looks like

a b c

1 2 3

1 1 4

1 2 4

1 0 5

then the result of sorting by a b c differs from sorting by a c. So if you sort by a c and use by a b c you'll get the out of order error.

HINT: Either copy the by statement from the sort to the by group processing procedure or vise versa.

SuryaKiran · Posted 10-27-2016 04:09 PM

I recommend to sort the data before you run the last dataset. You mentioned that the code ran sucessfully for 1 month. I hope the data in that month is in the sort order.

In your case I assume when you run for 1month all observations has "Client_Specialty_Flag" as "Specialty" . When you run for 4 months maybe it may get "Non-Specialty" that appears in unsorted order.

For example :

one dataset has all specialty and two dataset has mixed. then if we use "by" with set to unsorted data creates error. dataset one1 runs without errors and dataset two2 has errors

data one;

input num type $15. ;

datalines;

1 specialty

2 specialty

3 specialty

4 specialty

;

run;

data two;

input num type $15. ;

datalines;

1 specialty

2 specialty

3 non-specialty

4 specialty

;

run;

data one1;

set one;

by type;

run;

data two2;

set two;

by type;

run;

Thanks,
Suryakiran

Shmuel · Posted 10-27-2016 09:55 PM

@JediApprentice have an example of each month is like:

1 non_speicalty

2 non_specialty

3 specialty

4 specialty

so every month by itself is sorted ascending ok.

now take two month, then you will get a non sorted data:

1 non_speicalty

2 non_specialty

3 specialty

4 specialty

1 non_speicalty <<<<< breaking point of ascending sort

2 non_specialty

3 specialty

4 specialty

Classroom Training Available!