Solved: Re: PROC SQL SELECT DISTINCT

JediApprentice · Posted 09-16-2016 12:05 PM

I can't post sample data because these datasets are extremely large. This is the problem I'm running into. When I run this query:

PROC SQL;
   CREATE TABLE EGTASK.DATA_WITH_MAC_GROUPING2 AS 
   SELECT t1.AffiliationID, 
          t1.ApprovedPriceType, 
          t1.BRAND_GENERIC, 
          t1.CarrierID, 
          t1.ClaimOriginationFlag, 
          t1.ClaimStatus, 
          t1.CLIENT, 
          t1.CostTypeCode, 
          t1.DISPENSEASWRITTENDAWPRODUCTSELE, 
          t1.ExtendedSupplyNetworkIndicator, 
          t1.FilledMonth, 
          t1.FilledQuarter, 
          t1.FinalPlanCode, 
          t1.GPINumber, 
          t1.MailOrderIndicator, 
          t1.MemberID, 
          t1.MultiSourceCode, 
          t1.PlanDrugStatus, 
          t1.PriceScheduleDefinition, 
          t1.RxNetworkID, 
          t1.RXNetworkID_CHAIN, 
          t1.SERVICEPROVIDERID, 
          t1.Specialty_Flag, 
          t1.SuperPharmacyNetworkID, 
          t1.ZIP, 
          t1.SUM_of_ApprovedIngredientCost, 
          t1.SUM_of_SUM_of_AIC, 
          t1.SUM_of_SUM_of_AWP, 
          t1.SUM_of_SUM_of_RX, 
          t1.SUM_of_SUM_of_QUANTITYDISPENSED, 
          t1.AFF_NAME, 
          t1.AFF_TYPE, 
          t1.NETWORK, 
          t1.'Business Line'n, 
          t1.'Analytics Grouping 1'n, 
          t1.'MAC Savings Grouping'n, 
          t1.IVLIndicator, 
          t1.Fund_Type, 
          /* GER Pharm */
            (CASE  
               WHEN t1.RXNetworkID IN('00008','00177','00608','00782','00039','00123','00207') THEN 'CVS' 
               WHEN t1.RXNetworkID IN('00A13','00226') THEN 'Walgreens'
               WHEN t1.RXNetworkID IN('00599','00886','00519') THEN 'Omnicare'
               WHEN t1.RXNetworkID IN('00631','00752','00754','00686') THEN 'Pharmerica'
               WHEN t1.RXNetworkID = '00905' AND t1.SuperPharmacyNetworkID IN('MDLTC','MDLTCS') THEN 'MHA'
               ELSE 'Other'
            END) AS 'GER Pharm'n, 
          t2.ApprovedPriceType AS Instate, 
          t3.ApprovedPriceType AS Code_MAC
      FROM EGTASK.DATA_WITH_MAC_GROUPING t1
           LEFT JOIN EGTASK.'TABLES FOR SAS'n t2 ON (t1.SuperPharmacyNetworkID = t2.SuperPharmacyNetworkID)
           LEFT JOIN EGTASK.'TABLES FOR SAS_0000'n t3 ON (t1.FinalPlanCode = t3.FinalPlanCode);
QUIT;

I'm getting about 2 million observations. If I add DISTINCT to the SELECT clause I still get 2 million observations. However when I run this one (which directly follows the previous one):

PROC SQL;
   CREATE TABLE EGTASK.All_Combined_Table AS 
   SELECT DISTINCT t1.ApprovedPriceType,
          t1.CarrierID, 
          t1.CLIENT,
          t1.FilledMonth, 
          t1.FilledQuarter AS 'Filled Quarter'n, 
          t1.GPINumber, 
          t1.MailOrderIndicator,
          count(distinct(t1.MemberID)) as Lives, 
          t1.MultiSourceCode, 
          t1.PriceScheduleDefinition, 
          t1.RxNetworkID AS RXNetwork,
          t1.SuperPharmacyNetworkID, 
		  /* SUM_OF_ApprovedIngredientCost */
            (SUM(t1.SUM_of_SUM_of_AIC)) FORMAT=DOLLAR16.2 AS SUM_OF_ApprovedIngredientCost, 
          /* SUM_OF_AWP */
            (SUM(t1.SUM_of_SUM_of_AWP)) AS SUM_OF_AWP, 
          /* SUM_OF_RX */
            (SUM(t1.SUM_of_SUM_of_RX)) AS SUM_OF_RX, 
          /* SUM_OF_QUANTITYDISPENSED */
            (SUM(t1.SUM_of_SUM_of_QUANTITYDISPENSED)) FORMAT=13.3 AS SUM_OF_QUANTITYDISPENSED,
          t1.AFF_NAME AS 'AFF Name'n, 
          t1.AFF_TYPE AS 'AFF Type'n, 
          t1.NETWORK, 
          t1.'Business Line'n AS Business_Line, 
          t1.'Analytics Grouping 1'n AS Analytics_Grouping1, 
          t1.'MAC Savings Grouping'n AS MACSavingsGrouping, 
          t1.IVLIndicator,
		  t1.Fund_Type,
		  t1.'GER Pharm'n,
		  t1.Instate,
		  t1.Code_MAC,
		  t1.MAC_List
      FROM EGTASK.DATA_WITH_MAC_GROUPING2 t1
      WHERE t1.NETWORK NOT = 'PAPERMATCH'
      GROUP BY t1.ApprovedPriceType, t1.CarrierID, t1.CLIENT, t1.FilledMonth, t1.FilledQuarter, t1.GPINumber, 
              t1.MailOrderIndicator, t1.MultiSourceCode, t1.PriceScheduleDefinition, t1.RxNetworkID, 
              t1.SuperPharmacyNetworkID, t1.AFF_NAME, t1.AFF_TYPE, t1.NETWORK, t1.'Business Line'n, t1.
              'Analytics Grouping 1'n, t1.'MAC Savings Grouping'n, t1.MAC_List, t1.IVLIndicator, t1.Fund_Type;
QUIT;

I also get 2 million records. However if I add the DISCINT to the SELECT clause on this one (The ALL_COMBINED_TABLE), I get about 800,000 observations. I don't understand why this happens.

Shmuel · Posted 09-16-2016 02:40 PM

On second SQL program you have a GROUP statement.

With DISTINCT you got one obervation per group only.

Does it explain and give an answer to your question ?

View solution in original post

Shmuel · Posted 09-16-2016 02:40 PM

On second SQL program you have a GROUP statement.

With DISTINCT you got one obervation per group only.

Does it explain and give an answer to your question ?

JediApprentice · Posted 09-16-2016 02:50 PM

Then with the DISTINCT on the second one, would I be getting rid of records I'd actually want to keep do you think?

Shmuel · Posted 09-16-2016 03:01 PM

Sorry, I don't understand what you mean by "getting rid of records I'd actually want to keep" ?

Can you give a schematic example of one group input (detailed) and output expected ?

PROC SQL SELECT DISTINCT

Re: PROC SQL SELECT DISTINCT

Re: PROC SQL SELECT DISTINCT

Re: PROC SQL SELECT DISTINCT

Re: PROC SQL SELECT DISTINCT

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away