BookmarkSubscribeRSS Feed
Primohunter
Obsidian | Level 7

Hello everyone,
i m trying to create a new column (using case when and contains/like) depending on the context of another string column, but i find it troublesome when it comes to containing numbers such as 50 and 500 for example:

 

HAVE DATASET

CPP

blabla DP50_L

blablablu DP 50 L

ble DP50

blx DP50_500 

blibli DP500_L

bliblibla DP 500 L

bliblible DP500

blublu DP500 50

 

WANT DATASET

CPP                               SPEED

blabla DP50_L                50

blablablu DP 50 L           50

ble DP50                        50

blx DP50_500                50

blibli DP500_L                500

bliblibla DP 500 L           500

bliblible DP500               500

blublu DP500 50             500

 

As a result, i have faulty classification. What is the proper way to create the column of the example above?

Thanks alot for your time and insights!

3 REPLIES 3
LinusH
Tourmaline | Level 20

Data cleansing is a cumbersome job, and sometimes it's hard to get it 100& perfect.

My take would to find starting position of string "DP", substring the following 4 characters, remove any non numerical character, and then input that result to a numerical column.

Data never sleeps
mkeintz
PROC Star

I think there is some ambiguity in your sample WANT.  Namely, what is the rule you intend to apply to get the speed values?  In particular, is it the first number in the text of CPP?  Is it the first number after the string 'DP'?

 

If it's the earlier (first number in the text of CPP):

 

data have;
  input cpp $char25. ;
datalines;
blabla DP50_L
blablablu DP 50 L
ble DP50
blx DP50_500 
blibli DP500_L
bliblibla DP 500 L
bliblible DP500
blublu DP500 50
run;

data want;
  set have;
  *find position of first numeric character,  call it _char1 ;
  _char1=anydigit(cpp);

  *generate substring length until first non-numeric character after _char1, call it _length;
  _length=notdigit(substr(cpp,_char1))-1;

  *Read that substring into a numeric variable;
  speed=input(substr(cpp,_char1,_length),best32.);
run;

 

Edited note:  This works only if the numeric value is a non-negative integer. 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Ksharp
Super User
data have;
input CPP $40.;
cards;
blabla DP50_L
blablablu DP 50 L
ble DP50
blx DP50_500 
blibli DP500_L
bliblibla DP 500 L
bliblible DP500
blublu DP500 50
;

proc sql;
select cpp,case when cpp like '%50%' then 
case when substr(cpp,find(cpp,'50'),3)='500' then 500 else 50 end
           else . end as want 
 from have;
quit;

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 617 views
  • 3 likes
  • 4 in conversation