DATA Step, Macro, Functions and more

Need help with duplicate processing

Accepted Solution Solved
Reply
Contributor
Posts: 35
Accepted Solution

Need help with duplicate processing

Hello everyone,

 

I have a dataset like the following, the 1st letter in first column can be 'H'(header) or Blank. When it is header record would like to get the columns from 8 to 13(6 chars-896506) say router# and the datetime from 44 to 53(10 chars-1704171513) say dttm,  when it is blank like in the second line  would like get columns from 2nd to 11(9 chars) say acct#. Router# will be same for all the acct#s until the another header is found. 

 

INPUT:

 

H1704158965068107143934AAAAA        MAAAAA300A1704171513
   073340146000000502484817040100010032171+00000032446+00000015623+ 0000000
   073676756000000502479917040100019167574+00000063989+00000029219+ 0000000
   074974084000000502479717040100012097496+00000036608+00000021058+ 0000000

H1704158965068107143934AAAAA        MAAAAA300A1704171717
   073340146000000502484817040100010032171+00000032446+00000015623+ 0000000
   074985264000000502487917040100021946562+00000070979+00000034166+ 0000000

H1704158965068107143934AAAAA        MAAAAA300A1704171809
   079814999000000502485617040100012716909+00000042454+00000019288+ 0000000  

   073340146000000502484817040100010032171+00000032446+00000015623+ 0000000
  

 

output:

router1#acct1#dttm1

router1#acct2#dttm1

router1#acct3#dttm1

router1#acct1#dttm2

router1#acct1#dttm3

 

Then would like to eliminate the duplicate records based on the timestamp which ever is the latest.

 

output 2:

router1#acct1#dttm3(assuming dttm3 is the latest one)

router1#acct2#dttm3

router1#acct3#dttm3

 

Afte the duplicates are eliminated, would like to have the original input file without the duplicates for further batch process.

I really appreciate your help.

 

Thank you

Neal. 


Accepted Solutions
Solution
‎05-09-2017 02:01 PM
Super User
Posts: 11,335

Re: Need help with duplicate processing

Posted in reply to saslovethemost

When posting text data you should post it in a code box opened using the forum {i} icon. The main message window may format text spaces/ tabs. For instance when I look at colum 44 to 53 on your first line I see "00A170417" when pasting into my editor.

 

Adjust the column numbers in the code to match your input file. Use an INFILE statement to point to your data instead of the datalines used below.

 

data read;
   informat router $6. date yymmdd6. hour minute f2. acct $9.;
   input @;
   if _infile_=:'H' then do;
      input router 8-13 date 47-52 hour 53-54 minute 55-56;
      dttm = dhms(date,hour,minute,0);
   end;
   else input acct 2-11;
   format dttm datetime16.;
   retain router dttm;
   if not missing(acct);
   keep router acct dttm;
datalines;
H1704158965068107143934AAAAA        MAAAAA300A1704171513
   073340146000000502484817040100010032171+00000032446+00000015623+ 0000000
   073676756000000502479917040100019167574+00000063989+00000029219+ 0000000
   074974084000000502479717040100012097496+00000036608+00000021058+ 0000000
H1704158965068107143934AAAAA        MAAAAA300A1704171717
   073340146000000502484817040100010032171+00000032446+00000015623+ 0000000
   074985264000000502487917040100021946562+00000070979+00000034166+ 0000000
H1704158965068107143934AAAAA        MAAAAA300A1704171809
   079814999000000502485617040100012716909+00000042454+00000019288+ 0000000  
   073340146000000502484817040100010032171+00000032446+00000015623+ 0000000
;
run;

proc sort data=read;
  by router acct dttm;
run;

data want;
  set read;
  by router acct dttm;
  if last.acct;
run;


You will have to explain further what you may mean by "the original input file without the duplicates" as that sounds like you may want data that was present but not read.

 

View solution in original post


All Replies
Solution
‎05-09-2017 02:01 PM
Super User
Posts: 11,335

Re: Need help with duplicate processing

Posted in reply to saslovethemost

When posting text data you should post it in a code box opened using the forum {i} icon. The main message window may format text spaces/ tabs. For instance when I look at colum 44 to 53 on your first line I see "00A170417" when pasting into my editor.

 

Adjust the column numbers in the code to match your input file. Use an INFILE statement to point to your data instead of the datalines used below.

 

data read;
   informat router $6. date yymmdd6. hour minute f2. acct $9.;
   input @;
   if _infile_=:'H' then do;
      input router 8-13 date 47-52 hour 53-54 minute 55-56;
      dttm = dhms(date,hour,minute,0);
   end;
   else input acct 2-11;
   format dttm datetime16.;
   retain router dttm;
   if not missing(acct);
   keep router acct dttm;
datalines;
H1704158965068107143934AAAAA        MAAAAA300A1704171513
   073340146000000502484817040100010032171+00000032446+00000015623+ 0000000
   073676756000000502479917040100019167574+00000063989+00000029219+ 0000000
   074974084000000502479717040100012097496+00000036608+00000021058+ 0000000
H1704158965068107143934AAAAA        MAAAAA300A1704171717
   073340146000000502484817040100010032171+00000032446+00000015623+ 0000000
   074985264000000502487917040100021946562+00000070979+00000034166+ 0000000
H1704158965068107143934AAAAA        MAAAAA300A1704171809
   079814999000000502485617040100012716909+00000042454+00000019288+ 0000000  
   073340146000000502484817040100010032171+00000032446+00000015623+ 0000000
;
run;

proc sort data=read;
  by router acct dttm;
run;

data want;
  set read;
  by router acct dttm;
  if last.acct;
run;


You will have to explain further what you may mean by "the original input file without the duplicates" as that sounds like you may want data that was present but not read.

 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 1 reply
  • 122 views
  • 0 likes
  • 2 in conversation