Solved: Compare categorical data over months at individual level the change, ...

khoffmann · Posted 10-13-2017 10:31 AM

Good day,

I have a data few data sets of the same information, each table represents a different month. I consolidated the data into one using SAS, and I am trying to track changes for each variable for each person each month. Here is a mock of the data--I created PK combining LIC ID and First and Last Name.

LIC_ID

PK

LIC_Num

LIC_Desc

Last_Name

First_Name

Middle_Name

Business_Name

CountyID

County

Expire_Date

address

City

State

ZipCode

5001

5001DoeJohn

11111

Doe

John

E

Family Med

88

COOK

10/1/2017

47500 North

Memphis

TN

45566

5665

5665DoeJane

22222

VOID

Doe

Jane

M

44

Dupage

12/20/2018

817 Terrain

Bensonville

IL

97887

488

488SmithJoe

33333

HOLD

Smith

Joe

D

Out of State

9/1/2017

NY

75566

5899

5899ParkSam

44444

Expired

Park

Sam

M

48

ALACHUA

9/30/2017

58088 23rd street

Orlando

FL

84998

5001

5001DoeJohn

11111

LIC Change

Doe

John

E

Family Med

88

COOK

10/1/2020

47500 North

Memphis

TN

45566

5665

5665DoeJane

22222

VOID

Doe

Jane

M

Corner Clinic

44

Dupage

12/20/2018

817 Terrain

Bensonville

IL

97887

488

488SmithJoe

33333

Smith

Joe

D

46

NEW YORK

1/1/2020

342 Washington

New York

NY

75566

5899

5899ParkSam

44444

Expired

Park

Sam

M

48

ALACHUA

9/30/2017

58088 23rd street

Orlando

FL

84998

48889

48889LockJohn

55555

NEW

Lock

John

Lock Clinic

MIAMI-DADE

10/25/2019

Atlanta

GA

79914

I am not sure if it was easier to do this as one table, or keep them as separate. This will be an ongoing analysis and would like to create a macro to do this, but am unsure where to start, or if it is possible. I am looking to have two outputs. One being a SAS dataset which will be the change for each individual for each variable by month, the second being some table using proc tabulate, or proc sgpanel. Basically, in my head the output would be like the below for the sgpanel:

Column1

JAN

FEB

MAR

APR

MAY

JUN

JUL

AUG

SEP

OCT

NOV

DEC

LIC_ID

0%

10%

0.05%

2%

0%

5%

25%

1%

0%

2%

PK

0%

LIC_NUM

0%

LIC_DESC

1%

0%

5%

0%

10%

2%

0%

1.50%

1%

0%

2%

LAST_NAME

0%

0.10%

0%

0.01%

0%

FIRST_NAME

0%

0.01%

0%

MIDDLE_NAME

0%

0.01%

0%

BUSINESS_NAME

etc.

COUNTY_ID

etc.

COUNTYID

etc.

COUNTY

etc.

EXPIRE_DATE

etc.

ADDRESS

etc.

CITY

etc.

STATE

etc.

ZIPCODE

etc.

Any help is much appreciated, I have been struggling trying to score categorical data over time as well as trying to figure out how to create a data set that would show here is the specific records, over this many months, this specific variable changed this many times, and it show at which month it changed. Thank you again! (The percentages do not correlate to the original table, just used for visual purposes).

Patrick · Posted 10-14-2017 12:16 AM

@khoffmann

What about something along the line of below?

data have;
  infile datalines dlm='|' dsd truncover;
  input LICID report_dt:date9. (Last_Name LN_CharCT LIC_Desc LIC_desc_CharCT) (:$30.);
  format report_dt date9.;
  datalines;
111|01May2017|Doe|3|Expired|7
111|01Jun2017|Smith|5|Renewed|7
222|01May2017|Parc|4|VOID|4
222|01Jun2017|Park|4|VOID|4
;
run;

proc sql;
  create table inter as
  select
    l.licid,
    l.report_dt,
    case
      when (not missing(r.licid)) then compged(r.Last_Name, l.Last_Name) 
      else .
      end as dif_Last_Name,
    case
      when (not missing(r.licid)) then compged(r.LN_CharCT, l.LN_CharCT) 
      else .
      end as dif_LN_CharCT,
    case
      when (not missing(r.LIC_Desc)) then compged(r.LIC_Desc, l.LIC_Desc) 
      else .
      end as dif_LIC_Desc,
    case
      when (not missing(r.LIC_desc_CharCT)) then compged(r.LIC_desc_CharCT, l.LIC_desc_CharCT) 
      else .
      end as dif_LIC_desc_CharCT
  from 
    have l left join have r
      on l.licid=r.licid and r.report_dt=intnx('month',l.report_dt,-1,'b')
  order by l.licid, l.report_dt 
  ;
quit;

proc transpose data=inter out=want(rename=(_name_=Var_Name));
  by licid;
  id report_dt;
  format report_dt monyy5.;
  var dif_Last_Name dif_LN_CharCT dif_LIC_Desc dif_LIC_desc_CharCT
  ;
run;

View solution in original post

ballardw · Posted 10-13-2017 11:40 AM

Whatever you are going to do it will probably work out much easier if you have a single record for each individual and point in time. Since you say this is "on going", which implies to me that you will be adding records at later dates, that point in time variable should be a SAS Date valued variable as it is much easier to calculate durations and intervals from SAS date values than from "May".

Here is an example using a SAS supplied data set you should have available. The first step summarized a daily data set to monthly and then the differences for a couple of variables are calculated from the monthly summaries. Then Tabulate is used to display them.

proc summary data=sashelp.snacks nway;
   class product date;
   Format date Monyy7.;
   var QtySold Price;
   output out=work.snacksum (drop= _:) sum(QtySold)=MonthSale mean(Price)=MonthMeanPrice;
run;

data work.difference;
   set work.snacksum;
   by product date;

   QuantityDif = dif(MonthSale);
   PriceDif    = round( dif(MonthMeanPrice), 0.01);
   if first.product then do;
      QuantityDif =0;
      PriceDif=0;
   end;
   
run;

proc tabulate data=work.difference;
   class product date;
   var QuantityDif PriceDif;
   tables product *(QuantityDif PriceDif),
          date * max=''*f=best8.;
   label
      QuantityDif='Units sold difference'
      priceDif   ='Difference mean monthly price'
   ;
run;

If you add data to the starting data set, such as sashelp.snacks then the intermediate steps will summarize using the date value and then tabulate (or proc report) shows the results including the new data. No macro needed. The Product would be replaced by your individual identifier. Notice that the report table handles the display of the monthly differences in a "nice" manner: sorted by value which does not happen with values like "May" and "August" as text varaibles.

I do wonder what you may have been thinking about % for Last name though???

khoffmann · Posted 10-13-2017 12:21 PM

That definitely makes sense! Thank you for the reply!

That is the piece I am struggling with, I want to almost do a letter by letter macro analysis for the characters to say for instance below within a matrix style table:

LICID Month Last Name LN_CharCT LIC_Desc LIC_desc_CharCT

111 May Doe 3 Expired 7

111 June Smith 5 Renewed 7

222 May Parc 4 VOID 4

222 May Park 4 VOID 4

Matrix:

Month

LIC ID VAR .,.. MAY JUNE

111 Last_Name 0% 160 %

LIC_Desc 0% 100%

222 Last_Name 0% 25%

LIC_Desc 0% 0%

Even if the percentage is a stretch, or an exaggerated solution, it would be nice to see then instead of the percentage in the matix simply [changed, no_changed]. These will be monthly files that really what I care about is this last name changed from May to June, but the ability to show this in a standardized manner for both qualitative and quantitative pieces. I have no problem with the change/no_change in the table but I would like it for both pieces of data.

ballardw · Posted 10-13-2017 04:54 PM

@khoffmann wrote:

That definitely makes sense! Thank you for the reply!

That is the piece I am struggling with, I want to almost do a letter by letter macro analysis for the characters to say for instance below within a matrix style table:

LICID Month Last Name LN_CharCT LIC_Desc LIC_desc_CharCT

111 May Doe 3 Expired 7

111 June Smith 5 Renewed 7

222 May Parc 4 VOID 4

222 May Park 4 VOID 4

Matrix:

Month

LIC ID VAR .,.. MAY JUNE

111 Last_Name 0% 160 %

LIC_Desc 0% 100%

222   Last_Name 0% 25%

LIC_Desc   0% 0%

Even if the percentage is a stretch, or an exaggerated solution, it would be nice to see then instead of the percentage in the matix simply [changed, no_changed]. These will be monthly files that really what I care about is this last name changed from May to June, but the ability to show this in a standardized manner for both qualitative and quantitative pieces. I have no problem with the change/no_change in the table but I would like it for both pieces of data.

If you are looking for differences in spelling there are a number of functions that will return a numeric "distance" between words.

Compged and Complev are two of them.

In a data set sorted as you need it you could use

LastNameDif = compged(lag(Lastname), Lastname));

With a corresponding: If first.personid then LastNameDif=0.

If the names are identical then compged will return 0. Small differences "Jones" vs "Jone" will have small values for the result (50 in this case) "Jones" and "Freddy" have a compged value of 610.

So you could see successive changes in a magnitude sense. You can customize the results by using the function Call COMPCOST to set how much importance you see for a deleted/inserted letter or punctuation (set low if you want "ABC Company, Inc" to match "ABC Company Inc" for instance.

Patrick · Posted 10-14-2017 12:16 AM

@khoffmann

What about something along the line of below?

data have;
  infile datalines dlm='|' dsd truncover;
  input LICID report_dt:date9. (Last_Name LN_CharCT LIC_Desc LIC_desc_CharCT) (:$30.);
  format report_dt date9.;
  datalines;
111|01May2017|Doe|3|Expired|7
111|01Jun2017|Smith|5|Renewed|7
222|01May2017|Parc|4|VOID|4
222|01Jun2017|Park|4|VOID|4
;
run;

proc sql;
  create table inter as
  select
    l.licid,
    l.report_dt,
    case
      when (not missing(r.licid)) then compged(r.Last_Name, l.Last_Name) 
      else .
      end as dif_Last_Name,
    case
      when (not missing(r.licid)) then compged(r.LN_CharCT, l.LN_CharCT) 
      else .
      end as dif_LN_CharCT,
    case
      when (not missing(r.LIC_Desc)) then compged(r.LIC_Desc, l.LIC_Desc) 
      else .
      end as dif_LIC_Desc,
    case
      when (not missing(r.LIC_desc_CharCT)) then compged(r.LIC_desc_CharCT, l.LIC_desc_CharCT) 
      else .
      end as dif_LIC_desc_CharCT
  from 
    have l left join have r
      on l.licid=r.licid and r.report_dt=intnx('month',l.report_dt,-1,'b')
  order by l.licid, l.report_dt 
  ;
quit;

proc transpose data=inter out=want(rename=(_name_=Var_Name));
  by licid;
  id report_dt;
  format report_dt monyy5.;
  var dif_Last_Name dif_LN_CharCT dif_LIC_Desc dif_LIC_desc_CharCT
  ;
run;

khoffmann · Posted 10-17-2017 09:42 AM

Thank you Patrick!

Compare categorical data over months at individual level the change, and overall % change for vars

Re: Compare categorical data over months at individual level the change, and overall % change for v

Re: Compare categorical data over months at individual level the change, and overall % change for v

Re: Compare categorical data over months at individual level the change, and overall % change for v

Re: Compare categorical data over months at individual level the change, and overall % change for v

Re: Compare categorical data over months at individual level the change, and overall % change for v

Re: Compare categorical data over months at individual level the change, and overall % change for v

Compare categorical data over months at individual level the change, and overall % change for vars

Re: Compare categorical data over months at individual level the change, and overall % change for v

Re: Compare categorical data over months at individual level the change, and overall % change for v

Re: Compare categorical data over months at individual level the change, and overall % change for v

Re: Compare categorical data over months at individual level the change, and overall % change for v

Re: Compare categorical data over months at individual level the change, and overall % change for v

Re: Compare categorical data over months at individual level the change, and overall % change for v

2025 SAS Hackathon: There is still time!

SAS Training: Just a Click Away