Good day,
I have a data few data sets of the same information, each table represents a different month. I consolidated the data into one using SAS, and I am trying to track changes for each variable for each person each month. Here is a mock of the data--I created PK combining LIC ID and First and Last Name.
LIC_ID | PK | LIC_Num | LIC_Desc | Last_Name | First_Name | Middle_Name | Business_Name | CountyID | County | Expire_Date | address | City | State | ZipCode |
5001 | 5001DoeJohn | 11111 | Doe | John | E | Family Med | 88 | COOK | 10/1/2017 | 47500 North | Memphis | TN | 45566 | |
5665 | 5665DoeJane | 22222 | VOID | Doe | Jane | M | 44 | Dupage | 12/20/2018 | 817 Terrain | Bensonville | IL | 97887 | |
488 | 488SmithJoe | 33333 | HOLD | Smith | Joe | D | Out of State | 9/1/2017 | NY | 75566 | ||||
5899 | 5899ParkSam | 44444 | Expired | Park | Sam | M | 48 | ALACHUA | 9/30/2017 | 58088 23rd street | Orlando | FL | 84998 | |
5001 | 5001DoeJohn | 11111 | LIC Change | Doe | John | E | Family Med | 88 | COOK | 10/1/2020 | 47500 North | Memphis | TN | 45566 |
5665 | 5665DoeJane | 22222 | VOID | Doe | Jane | M | Corner Clinic | 44 | Dupage | 12/20/2018 | 817 Terrain | Bensonville | IL | 97887 |
488 | 488SmithJoe | 33333 | Smith | Joe | D | 46 | NEW YORK | 1/1/2020 | 342 Washington | New York | NY | 75566 | ||
5899 | 5899ParkSam | 44444 | Expired | Park | Sam | M | 48 | ALACHUA | 9/30/2017 | 58088 23rd street | Orlando | FL | 84998 | |
48889 | 48889LockJohn | 55555 | NEW | Lock | John | Lock Clinic | MIAMI-DADE | 10/25/2019 | Atlanta | GA | 79914 |
I am not sure if it was easier to do this as one table, or keep them as separate. This will be an ongoing analysis and would like to create a macro to do this, but am unsure where to start, or if it is possible. I am looking to have two outputs. One being a SAS dataset which will be the change for each individual for each variable by month, the second being some table using proc tabulate, or proc sgpanel. Basically, in my head the output would be like the below for the sgpanel:
Column1 | JAN | FEB | MAR | APR | MAY | JUN | JUL | AUG | SEP | OCT | NOV | DEC |
LIC_ID | 0% | 10% | 0.05% | 2% | 0% | 0% | 5% | 25% | 1% | 0% | 0% | 2% |
PK | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
LIC_NUM | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
LIC_DESC | 1% | 0% | 5% | 0% | 10% | 2% | 0% | 0% | 1.50% | 1% | 0% | 2% |
LAST_NAME | 0% | 0.10% | 0% | 0.01% | 0% | 0% | 0% | 0% | 0% | 0% | 0% | 0% |
FIRST_NAME | 0% | 0% | 0% | 0% | 0% | 0.01% | 0% | 0% | 0% | 0% | 0% | 0% |
MIDDLE_NAME | 0% | 0% | 0% | 0% | 0% | 0.01% | 0% | 0% | 0% | 0% | 0% | 0% |
BUSINESS_NAME | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. |
COUNTY_ID | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. |
COUNTYID | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. |
COUNTY | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. |
EXPIRE_DATE | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. |
ADDRESS | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. |
CITY | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. |
STATE | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. |
ZIPCODE | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. | etc. |
Any help is much appreciated, I have been struggling trying to score categorical data over time as well as trying to figure out how to create a data set that would show here is the specific records, over this many months, this specific variable changed this many times, and it show at which month it changed. Thank you again! (The percentages do not correlate to the original table, just used for visual purposes).
What about something along the line of below?
data have;
infile datalines dlm='|' dsd truncover;
input LICID report_dt:date9. (Last_Name LN_CharCT LIC_Desc LIC_desc_CharCT) (:$30.);
format report_dt date9.;
datalines;
111|01May2017|Doe|3|Expired|7
111|01Jun2017|Smith|5|Renewed|7
222|01May2017|Parc|4|VOID|4
222|01Jun2017|Park|4|VOID|4
;
run;
proc sql;
create table inter as
select
l.licid,
l.report_dt,
case
when (not missing(r.licid)) then compged(r.Last_Name, l.Last_Name)
else .
end as dif_Last_Name,
case
when (not missing(r.licid)) then compged(r.LN_CharCT, l.LN_CharCT)
else .
end as dif_LN_CharCT,
case
when (not missing(r.LIC_Desc)) then compged(r.LIC_Desc, l.LIC_Desc)
else .
end as dif_LIC_Desc,
case
when (not missing(r.LIC_desc_CharCT)) then compged(r.LIC_desc_CharCT, l.LIC_desc_CharCT)
else .
end as dif_LIC_desc_CharCT
from
have l left join have r
on l.licid=r.licid and r.report_dt=intnx('month',l.report_dt,-1,'b')
order by l.licid, l.report_dt
;
quit;
proc transpose data=inter out=want(rename=(_name_=Var_Name));
by licid;
id report_dt;
format report_dt monyy5.;
var dif_Last_Name dif_LN_CharCT dif_LIC_Desc dif_LIC_desc_CharCT
;
run;
Whatever you are going to do it will probably work out much easier if you have a single record for each individual and point in time. Since you say this is "on going", which implies to me that you will be adding records at later dates, that point in time variable should be a SAS Date valued variable as it is much easier to calculate durations and intervals from SAS date values than from "May".
Here is an example using a SAS supplied data set you should have available. The first step summarized a daily data set to monthly and then the differences for a couple of variables are calculated from the monthly summaries. Then Tabulate is used to display them.
proc summary data=sashelp.snacks nway; class product date; Format date Monyy7.; var QtySold Price; output out=work.snacksum (drop= _:) sum(QtySold)=MonthSale mean(Price)=MonthMeanPrice; run; data work.difference; set work.snacksum; by product date; QuantityDif = dif(MonthSale); PriceDif = round( dif(MonthMeanPrice), 0.01); if first.product then do; QuantityDif =0; PriceDif=0; end; run; proc tabulate data=work.difference; class product date; var QuantityDif PriceDif; tables product *(QuantityDif PriceDif), date * max=''*f=best8.; label QuantityDif='Units sold difference' priceDif ='Difference mean monthly price' ; run;
If you add data to the starting data set, such as sashelp.snacks then the intermediate steps will summarize using the date value and then tabulate (or proc report) shows the results including the new data. No macro needed. The Product would be replaced by your individual identifier. Notice that the report table handles the display of the monthly differences in a "nice" manner: sorted by value which does not happen with values like "May" and "August" as text varaibles.
I do wonder what you may have been thinking about % for Last name though???
That definitely makes sense! Thank you for the reply!
That is the piece I am struggling with, I want to almost do a letter by letter macro analysis for the characters to say for instance below within a matrix style table:
LICID Month Last Name LN_CharCT LIC_Desc LIC_desc_CharCT
111 May Doe 3 Expired 7
111 June Smith 5 Renewed 7
222 May Parc 4 VOID 4
222 May Park 4 VOID 4
Matrix:
Month
LIC ID VAR .,.. MAY JUNE
111 Last_Name 0% 160 %
LIC_Desc 0% 100%
222 Last_Name 0% 25%
LIC_Desc 0% 0%
Even if the percentage is a stretch, or an exaggerated solution, it would be nice to see then instead of the percentage in the matix simply [changed, no_changed]. These will be monthly files that really what I care about is this last name changed from May to June, but the ability to show this in a standardized manner for both qualitative and quantitative pieces. I have no problem with the change/no_change in the table but I would like it for both pieces of data.
@khoffmann wrote:
That definitely makes sense! Thank you for the reply!
That is the piece I am struggling with, I want to almost do a letter by letter macro analysis for the characters to say for instance below within a matrix style table:
LICID Month Last Name LN_CharCT LIC_Desc LIC_desc_CharCT
111 May Doe 3 Expired 7
111 June Smith 5 Renewed 7
222 May Parc 4 VOID 4
222 May Park 4 VOID 4
Matrix:
Month
LIC ID VAR .,.. MAY JUNE
111 Last_Name 0% 160 %
LIC_Desc 0% 100%
222 Last_Name 0% 25%
LIC_Desc 0% 0%
Even if the percentage is a stretch, or an exaggerated solution, it would be nice to see then instead of the percentage in the matix simply [changed, no_changed]. These will be monthly files that really what I care about is this last name changed from May to June, but the ability to show this in a standardized manner for both qualitative and quantitative pieces. I have no problem with the change/no_change in the table but I would like it for both pieces of data.
If you are looking for differences in spelling there are a number of functions that will return a numeric "distance" between words.
Compged and Complev are two of them.
In a data set sorted as you need it you could use
LastNameDif = compged(lag(Lastname), Lastname));
With a corresponding: If first.personid then LastNameDif=0.
If the names are identical then compged will return 0. Small differences "Jones" vs "Jone" will have small values for the result (50 in this case) "Jones" and "Freddy" have a compged value of 610.
So you could see successive changes in a magnitude sense. You can customize the results by using the function Call COMPCOST to set how much importance you see for a deleted/inserted letter or punctuation (set low if you want "ABC Company, Inc" to match "ABC Company Inc" for instance.
What about something along the line of below?
data have;
infile datalines dlm='|' dsd truncover;
input LICID report_dt:date9. (Last_Name LN_CharCT LIC_Desc LIC_desc_CharCT) (:$30.);
format report_dt date9.;
datalines;
111|01May2017|Doe|3|Expired|7
111|01Jun2017|Smith|5|Renewed|7
222|01May2017|Parc|4|VOID|4
222|01Jun2017|Park|4|VOID|4
;
run;
proc sql;
create table inter as
select
l.licid,
l.report_dt,
case
when (not missing(r.licid)) then compged(r.Last_Name, l.Last_Name)
else .
end as dif_Last_Name,
case
when (not missing(r.licid)) then compged(r.LN_CharCT, l.LN_CharCT)
else .
end as dif_LN_CharCT,
case
when (not missing(r.LIC_Desc)) then compged(r.LIC_Desc, l.LIC_Desc)
else .
end as dif_LIC_Desc,
case
when (not missing(r.LIC_desc_CharCT)) then compged(r.LIC_desc_CharCT, l.LIC_desc_CharCT)
else .
end as dif_LIC_desc_CharCT
from
have l left join have r
on l.licid=r.licid and r.report_dt=intnx('month',l.report_dt,-1,'b')
order by l.licid, l.report_dt
;
quit;
proc transpose data=inter out=want(rename=(_name_=Var_Name));
by licid;
id report_dt;
format report_dt monyy5.;
var dif_Last_Name dif_LN_CharCT dif_LIC_Desc dif_LIC_desc_CharCT
;
run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.