I can't post sample data because these datasets are extremely large. This is the problem I'm running into. When I run this query:
PROC SQL;
CREATE TABLE EGTASK.DATA_WITH_MAC_GROUPING2 AS
SELECT t1.AffiliationID,
t1.ApprovedPriceType,
t1.BRAND_GENERIC,
t1.CarrierID,
t1.ClaimOriginationFlag,
t1.ClaimStatus,
t1.CLIENT,
t1.CostTypeCode,
t1.DISPENSEASWRITTENDAWPRODUCTSELE,
t1.ExtendedSupplyNetworkIndicator,
t1.FilledMonth,
t1.FilledQuarter,
t1.FinalPlanCode,
t1.GPINumber,
t1.MailOrderIndicator,
t1.MemberID,
t1.MultiSourceCode,
t1.PlanDrugStatus,
t1.PriceScheduleDefinition,
t1.RxNetworkID,
t1.RXNetworkID_CHAIN,
t1.SERVICEPROVIDERID,
t1.Specialty_Flag,
t1.SuperPharmacyNetworkID,
t1.ZIP,
t1.SUM_of_ApprovedIngredientCost,
t1.SUM_of_SUM_of_AIC,
t1.SUM_of_SUM_of_AWP,
t1.SUM_of_SUM_of_RX,
t1.SUM_of_SUM_of_QUANTITYDISPENSED,
t1.AFF_NAME,
t1.AFF_TYPE,
t1.NETWORK,
t1.'Business Line'n,
t1.'Analytics Grouping 1'n,
t1.'MAC Savings Grouping'n,
t1.IVLIndicator,
t1.Fund_Type,
/* GER Pharm */
(CASE
WHEN t1.RXNetworkID IN('00008','00177','00608','00782','00039','00123','00207') THEN 'CVS'
WHEN t1.RXNetworkID IN('00A13','00226') THEN 'Walgreens'
WHEN t1.RXNetworkID IN('00599','00886','00519') THEN 'Omnicare'
WHEN t1.RXNetworkID IN('00631','00752','00754','00686') THEN 'Pharmerica'
WHEN t1.RXNetworkID = '00905' AND t1.SuperPharmacyNetworkID IN('MDLTC','MDLTCS') THEN 'MHA'
ELSE 'Other'
END) AS 'GER Pharm'n,
t2.ApprovedPriceType AS Instate,
t3.ApprovedPriceType AS Code_MAC
FROM EGTASK.DATA_WITH_MAC_GROUPING t1
LEFT JOIN EGTASK.'TABLES FOR SAS'n t2 ON (t1.SuperPharmacyNetworkID = t2.SuperPharmacyNetworkID)
LEFT JOIN EGTASK.'TABLES FOR SAS_0000'n t3 ON (t1.FinalPlanCode = t3.FinalPlanCode);
QUIT;
I'm getting about 2 million observations. If I add DISTINCT to the SELECT clause I still get 2 million observations. However when I run this one (which directly follows the previous one):
PROC SQL;
CREATE TABLE EGTASK.All_Combined_Table AS
SELECT DISTINCT t1.ApprovedPriceType,
t1.CarrierID,
t1.CLIENT,
t1.FilledMonth,
t1.FilledQuarter AS 'Filled Quarter'n,
t1.GPINumber,
t1.MailOrderIndicator,
count(distinct(t1.MemberID)) as Lives,
t1.MultiSourceCode,
t1.PriceScheduleDefinition,
t1.RxNetworkID AS RXNetwork,
t1.SuperPharmacyNetworkID,
/* SUM_OF_ApprovedIngredientCost */
(SUM(t1.SUM_of_SUM_of_AIC)) FORMAT=DOLLAR16.2 AS SUM_OF_ApprovedIngredientCost,
/* SUM_OF_AWP */
(SUM(t1.SUM_of_SUM_of_AWP)) AS SUM_OF_AWP,
/* SUM_OF_RX */
(SUM(t1.SUM_of_SUM_of_RX)) AS SUM_OF_RX,
/* SUM_OF_QUANTITYDISPENSED */
(SUM(t1.SUM_of_SUM_of_QUANTITYDISPENSED)) FORMAT=13.3 AS SUM_OF_QUANTITYDISPENSED,
t1.AFF_NAME AS 'AFF Name'n,
t1.AFF_TYPE AS 'AFF Type'n,
t1.NETWORK,
t1.'Business Line'n AS Business_Line,
t1.'Analytics Grouping 1'n AS Analytics_Grouping1,
t1.'MAC Savings Grouping'n AS MACSavingsGrouping,
t1.IVLIndicator,
t1.Fund_Type,
t1.'GER Pharm'n,
t1.Instate,
t1.Code_MAC,
t1.MAC_List
FROM EGTASK.DATA_WITH_MAC_GROUPING2 t1
WHERE t1.NETWORK NOT = 'PAPERMATCH'
GROUP BY t1.ApprovedPriceType, t1.CarrierID, t1.CLIENT, t1.FilledMonth, t1.FilledQuarter, t1.GPINumber,
t1.MailOrderIndicator, t1.MultiSourceCode, t1.PriceScheduleDefinition, t1.RxNetworkID,
t1.SuperPharmacyNetworkID, t1.AFF_NAME, t1.AFF_TYPE, t1.NETWORK, t1.'Business Line'n, t1.
'Analytics Grouping 1'n, t1.'MAC Savings Grouping'n, t1.MAC_List, t1.IVLIndicator, t1.Fund_Type;
QUIT;
I also get 2 million records. However if I add the DISCINT to the SELECT clause on this one (The ALL_COMBINED_TABLE), I get about 800,000 observations. I don't understand why this happens.
On second SQL program you have a GROUP statement.
With DISTINCT you got one obervation per group only.
Does it explain and give an answer to your question ?
On second SQL program you have a GROUP statement.
With DISTINCT you got one obervation per group only.
Does it explain and give an answer to your question ?
Then with the DISTINCT on the second one, would I be getting rid of records I'd actually want to keep do you think?
Sorry, I don't understand what you mean by "getting rid of records I'd actually want to keep" ?
Can you give a schematic example of one group input (detailed) and output expected ?
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.