BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AieuYuhara
Fluorite | Level 6

Hi all!

 

I am incredibly new to SAS Codes, and you guys did help me much before.

 

And now i still need more help from the experts.

Below are the codes for the data:

 

DATA TEMP;
INPUT @1 Ref 7. 
	  @9 ID 5. 
	  @14 LockerNo 12.
	  @28 Date ddmmyy10.
	  @39 Cust_Name $ 14.;
DATALINES;
4511445 1952  123456789123 08/12/2001 XYZ PRODUCTION 
4545154 1952  987654321987 26/09/2001 DEF Co.        
7895412 96321 456123789456 12/10/2000 ABC AGENCY     
7895412 96321 123789654753 21/12/2000 ABC AGENCY     
5451654 96321 125874934589 05/10/2006 ABC AGENCY     
5451654 96321 123467984352 23/11/2004 ABC AGENCY     
5451654 96321 785645464644 17/01/2005 ABC AGENCY     
5451654 96321 156454654546 11/11/2004 GHI Enterprise    
;
RUN; 

proc sort data=have out=have;
  by id DESCENDING date;
run;

data want;
  set have;
  by id;
  retain final;

  if first.id then
    final=ref;
run;

 

What I am trying to do: (All should be group by ID)

- I want variable FINAL to have its own Ref.

- If ID and CUST_NAME is the same, FINAL will equal to the value in Ref with the latest Date

- If ID is the same with the other records, but totally different Cust_Name, then FINAL will be have Ref value respective to its own Cust_Name that have the latest date

 

 

I am trying to achieve as below:

RefIDLockerNoDateCust_NameFinal
451144519521234567891238/12/2001XYZ PRODUCTION4511445
4545154195298765432198726/9/2001DEF Co.4545154
78954129632145612378945612/10/2000ABC AGENCY5451654
78954129632112378965475321/12/2000PINK Enterprise7895412
5451654963211258749345895/10/2006ABC AGENCY5451654
54516549632112346798435223/11/2004ABC AGENCY5451654
54516549632178564546464417/1/2005ABC AGENCY5451654
54516549632115645465454611/11/2004ABC AGENCY5451654

 

 

More clarification Example based on above table:

1)     Cust_Name of PINK Enterprise in Observation 4, shares the same ID as ABC AGENCY, but have different Ref, and diff Cust_Name, so Final will have the value of 7895412, instead of 5451654 (Follows the Ref for latest date of ABC Agency)

2)     Observation 2 have the same case as 1)

 

Really appreciate your help on this!

 

Thank you,

Aieu

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

Sort, but sort the date in reverse order. Then you can keep teh ref from the first.cust_name:

DATA TEMP;
INPUT @1 Ref 7. 
	  @9 ID 5. 
	  @14 LockerNo 12.
	  @28 Date ddmmyy10.
	  @39 Cust_Name $15.
;
format date ddmmyy10.;
DATALINES;
4511445 1952  123456789123 08/12/2001 XYZ PRODUCTION 
4545154 1952  987654321987 26/09/2001 DEF Co.        
7895412 96321 456123789456 12/10/2000 ABC AGENCY     
7895412 96321 123789654753 21/12/2000 ABC AGENCY     
5451654 96321 125874934589 05/10/2006 ABC AGENCY     
5451654 96321 123467984352 23/11/2004 ABC AGENCY     
5451654 96321 785645464644 17/01/2005 ABC AGENCY     
7895412 96321 123789654753 21/12/2000 PINK Enterprise
;
RUN; 

proc sort data=temp;
by id cust_name descending date;
run;

data want;
set temp;
by id cust_name;
retain final;
if first.cust_name then final = ref;
run;

proc sort data=want;
by id cust_name date;
run;

proc print data=want noobs;
run;

Note that I slightly changed your example data, as it did not contain the observarion for PINK Enterprise.

View solution in original post

7 REPLIES 7
Kurt_Bremser
Super User

Sort, but sort the date in reverse order. Then you can keep teh ref from the first.cust_name:

DATA TEMP;
INPUT @1 Ref 7. 
	  @9 ID 5. 
	  @14 LockerNo 12.
	  @28 Date ddmmyy10.
	  @39 Cust_Name $15.
;
format date ddmmyy10.;
DATALINES;
4511445 1952  123456789123 08/12/2001 XYZ PRODUCTION 
4545154 1952  987654321987 26/09/2001 DEF Co.        
7895412 96321 456123789456 12/10/2000 ABC AGENCY     
7895412 96321 123789654753 21/12/2000 ABC AGENCY     
5451654 96321 125874934589 05/10/2006 ABC AGENCY     
5451654 96321 123467984352 23/11/2004 ABC AGENCY     
5451654 96321 785645464644 17/01/2005 ABC AGENCY     
7895412 96321 123789654753 21/12/2000 PINK Enterprise
;
RUN; 

proc sort data=temp;
by id cust_name descending date;
run;

data want;
set temp;
by id cust_name;
retain final;
if first.cust_name then final = ref;
run;

proc sort data=want;
by id cust_name date;
run;

proc print data=want noobs;
run;

Note that I slightly changed your example data, as it did not contain the observarion for PINK Enterprise.

AieuYuhara
Fluorite | Level 6

Thanks! It works! and Thanks for noticing the my carelessness. 

 

Im just wondering if... as i edited on observation 3, cust_name of ABC Agency is keyed in differently as it should be,

how do we do the code so it still read the same record even if the cust_name is different quite a bit?

 

DATA TEMP;
INPUT @1 Ref 7. 
	  @9 ID 5. 
	  @14 LockerNo 12.
	  @28 Date ddmmyy10.
	  @39 Cust_Name $15.
;
format date ddmmyy10.;
DATALINES;
4511445 1952  123456789123 08/12/2001 XYZ PRODUCTION 
4545154 1952  987654321987 26/09/2001 DEF Co.        
7895412 96321 456123789456 12/10/2000 ABC AGE. Y     
7895412 96321 123789654753 21/12/2000 ABC AGENCY     
5451654 96321 125874934589 05/10/2006 ABC AGENCY     
5451654 96321 123467984352 23/11/2004 ABC AGENCY     
5451654 96321 785645464644 17/01/2005 ABC AGENCY     
7895412 96321 123789654753 21/12/2000 PINK Enterprise
;
RUN; 

proc sort data=temp;
by id cust_name descending date;
run;

data want;
set temp;
by id cust_name;
retain final;
if first.cust_name then final = ref;
run;

proc sort data=want;
by id cust_name date;
run;

proc print data=want noobs;
run;

 

Appreciate your help on this!

Kurt_Bremser
Super User

I'd solve that problem first by adding steps to cleanse data.

 

Usually, one does not use names for grouping, but ID values, and keeps names in a separate table that contains IDs and names. This table can be used to check for correct input data when you import data from external sources.

ballardw
Super User

@AieuYuhara wrote:

Thanks! It works! and Thanks for noticing the my carelessness. 

 

Im just wondering if... as i edited on observation 3, cust_name of ABC Agency is keyed in differently as it should be,

how do we do the code so it still read the same record even if the cust_name is different quite a bit?

 

 

Appreciate your help on this!


In many processes one of the first things after reading a new data set is some form of data quality checking. How extensive that process may be depends on data use and potential impact of different types of errors. In some case a field may have a known expected range of values such as branch office names, school grade, car brand or any relatively fixed list that may be checked on reading in a number of ways such as comparison with a known list or format. Or there are known rules for a value such as must be > 0 or within a specified range which can also be relatively easily checked.

 

Open text fields that may contain anything can be a bit harder to validate. Sometimes a simple frequency table will show an unexpected value(s). Then a pass through the data to correct those values may be in order.

 

I do have code in different projects to alert me of new reporting sites (not in an existing format list) as well as providing reports to the generating organizations on data quality elements like pregnant males, changing sex/ race/ date of birth between client visits and such.

 

 

AieuYuhara
Fluorite | Level 6
Yes, this is a data cleaning process and didnt have any table which can be used as referrals; totally a raw files.

I am wondering how data cleaning can be done through coding, like for the example I mentioned on the question posted, for string characters.

I wanted to see if first 5 characters is the same, or last 5 characters is the same, then they are referring to the same Ref.

Do you have any way which I can do this?
Kurt_Bremser
Super User

Data quality improvement is a sophisticated process. SAS provides an extra application (Dataflux) and specialised server instance (DQ Server) for this, with additional licenses.

 

Start with getting knowledge of your data. Run proc freq, and identify possible mistakes.

How you fix it depends wholely on the detected mistakes (uppercase/lowercase, added blanks, numeric/word notation, ...)

 

For names, an approach could be the following:

- set up a table with expected names

data exp_names;
infile cards dlm=',';
input expected_name :$25. actual_name :$25.;
cards;
ABC AGENCY,ABC AGENCY
ABC AGENCY,ABC AGE. Y
;
run;

You can now match input data against actual_name; if a match is found, replace with expected_name; if not, output to a new dataset indicating errors. You then inspect the error dataset and add new lines to exp_names as you see appropriate.

You may be able to create this dataset initially by using rules (first or last 5 matching, or similar) as you stated, but there will be occurences that can't be handled by a simple rule and need your intervention and the application of Brain 1.0.

 

All this is of course the result of poor process design. Any data used for categorization has to be checked on input (eg company names are selected from a drop-down list of companies instead of entered manually); allowing free form input of such values is sub-optimal, to say the least.

Kurt_Bremser
Super User

PS You are not "incredibly new" any longer. Anybody who can write a data step with example data on his/her own has made an important step on the way from beginner to regular SAS user.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 2180 views
  • 2 likes
  • 3 in conversation