DATA Step, Macro, Functions and more

Data Cleaning - Character String

Occasional Contributor
Posts: 16

Data Cleaning - Character String

Hi all, 


I have issues on my data.

What I am trying to do is data cleaning and the data I received is a raw file. 


The example for the data I am working is as below:

INPUT @1 Ref 7. 
	  @9 ID 5. 
	  @14 LockerNo 12.
	  @28 Date ddmmyy10.
	  @39 Cust_Name $15.
format date ddmmyy10.;
4511445 1952  123456789123 08/12/2001 XYZ PRODUCTION 
4545154 1952  987654321987 26/09/2001 DEF Co.        
7895412 96321 456123789456 12/10/2000 PINK Enterprise
7895412 96321 123789654753 21/12/2000 GbC AGENCY     
5451654 96321 125874934589 05/10/2006 ABC AGENCY     
5451654 96321 123467984352 23/11/2004 ABC AGENCu     
5451654 96321 785645464644 17/01/2005 ABC AGENCY     
7895412 96321 123789654753 21/12/2000 ABC AGENCY     

What I am trying to achieve is:

# Note that Everything will be grouped by ID. 

1) If the first 5 characters of Cust_Name or the last 5 characters of Cust_Name is the same, then a new variable, Final will have the value of Ref with latest Date


Below is the table I am trying to achieve:

451144519528/12/2001XYZ PRODUCTION4511445
4545154195226/9/2001DEF Co.4545154
78954129632112/10/2000PINK Enterprise7895412
78954129632121/12/2000GbC AGENCY5451654
5451654963215/10/2006ABC AGENCY5451654
54516549632123/11/2004ABC AGENCu5451654
54516549632117/1/2005ABC AGENCY5451654
54516549632121/12/2000ABC AGENCY5451654


Cust_Name is a character string and might have many patterns, so after looking at the data, some might have either for the first 5 characters or the last 5 characters. 




Btw, I am using SAS Enterprise Guide 7.1

Esteemed Advisor
Posts: 5,628

Re: Data Cleaning - Character String

Posted in reply to AieuYuhara

What is Final for this data?


ref ID Cust_Name Final
1 1 12345ABCDE ?
2 1 12345ABCDF ?
3 1 12346ABCDF ?

Cust_name from obs 1 and 2 share the first 5 chars and obs 2 and 3 share the last five chars.

Occasional Contributor
Posts: 16

Re: Data Cleaning - Character String

Hi PG,


Final will have the value which parent it should refer to. (The parent should refer to the Ref which have the latest date)


Cust_name (last column) from observation 1 and 2 does not share the first five characters. as one is XYZ Production and PINK DEF Co. Since they do not share or have same 5 character for the first or last, it have their own ref as parent.


This is the code to get the variable Final. But the problem is they read cust_name seperately even though they just differs 2/3 characters.


proc sort data=temp;
by id cust_name descending date;

data want;
set temp;
by id cust_name;
retain final;
if first.cust_name then final = ref;

proc sort data=want;
by id cust_name date;

proc print data=want noobs;

after you run the above code, what i want is observation 7 to copy 5451654 for Final. 
why? because they are in the same ID, only some of the characters differs.


Now I am looking for some codes which might help to group the data even though some of the customer_name differs one/two characters.


Appreciate your help!

Ask a Question
Discussion stats
  • 2 replies
  • 2 in conversation