BookmarkSubscribeRSS Feed
goyalrk
Calcite | Level 5

Hello, here is my first post in the community hoping to get some help. I have a large SAS dataset with a character field that i'd like to clean up by using a set of 2 delimiters.

 

for instance, my data has a string variable with observations like:

jkdfldfd-abc--123

dkjfds789-cdegfd--abc1

asddkj--dmco-hwrfd

....

....

 

I am trying to find a way to create a new variable that cuts off at the double '--' dash delimiter and not at the single '-' dash. I read somewhere that Scan function does not support multiple delimiters and some have suggested alternatives using infile statement. The infile approach is not efficient in my case since the data is already in SAS format and as I said it's a large dataset with many fields.

 

Is there another alternative that you can suggest?

 

Many thanks in advance for your help!

7 REPLIES 7
Tom
Super User Tom
Super User

Provide more details about what you want out.  If you just want to location the position of a string like '--' then use the INDEX() function.

 

var1='jkdfldfd-abc--123';
loc=index(var1,'--');
if loc then var2=substr(var1,1,loc-1);

 

116  data _null_;
117    length var1 var1 $100;
118    var1='jkdfldfd-abc--123';
119    loc=index(var1,'--');
120    if loc then var2=substr(var1,1,loc-1);
121    put (_all_) (=);
122  run;

var1=jkdfldfd-abc--123 loc=13 var2=jkdfldfd-abc
goyalrk
Calcite | Level 5

Thanks for your response. My goal here is to identify all the characters up until there are no two or more consecutive delimiters. 

So, for instance, from "adfrgk-dsfgdg--dfgfh" I want to extract "adfrgk-dsfgdg"; from "oasdjfiowd-adhf-kladhf----zdfdsfg" extract "oasdjfiowd-adhf-kladhf" into a separate variable.

Tom
Super User Tom
Super User

@goyalrk wrote:

Thanks for your response. My goal here is to identify all the characters up until there are no two or more consecutive delimiters. 

So, for instance, from "adfrgk-dsfgdg--dfgfh" I want to extract "adfrgk-dsfgdg"; from "oasdjfiowd-adhf-kladhf----zdfdsfg" extract "oasdjfiowd-adhf-kladhf" into a separate variable.


So that is what I posted. 

 

What do you want to happen when the source string does NOT contain the delimiter string?

The code I posted does nothing, but you could add an ELSE clause to implement whatever rule you want in that case.

 

If you want the whole string then you could avoid the extra variable and IF/THEN logic by appending '--' so that you are sure you will always find at least one delimiter string. 

newvar=substr(oldvar||'--',1,index(oldvar||'--','--')-1);
hashman
Ammonite | Level 13

@goyalrk:

There are many ways of doing this. One would be:

data have ;                                     
  input str $char22. ;                          
  cards ;                                       
jkdfldfd-abc--123                               
dkjfds789-cdegfd--abc1                          
asddkj--dmco-hwrfd                              
run ;                                           
                                                
data want ;                                     
  set have ;                                    
  str1 = substr (str, 1, find (str, "--") - 1) ;
  str2 = substr (str, find (str, "--") + 2) ;   
run ;                                           

Using SUBSTR means that STR1 and STR2 will assume the same system length as STR.

Another is to translate "--" into a single character other than a dash and use the SCAN function:

data want ;                                 
  set have ;                                
  str1 = str ;                              
  str2 = str ;                              
  str1 = scan (tranwrd (str, "--", ""), 1) ;
  str2 = scan (tranwrd (str, "--", ""), 2) ;
run ;                                       

Here STR1 and STR2 are first assigned STR to make them assume its system length; otherwise SCAN would make both $200 by default.

 

Kind regards

Paul D.

 

goyalrk
Calcite | Level 5

Thanks @hashman! For some reasons, both the alternatives below are returning an error: "invalid third argument". Am I missing something?

hashman
Ammonite | Level 13

@goyalrk:

The only reason I can see is that in your testing, you have a string with no more than 1 dash in a row anywhere. In this case, the expression:

 

find (str, "--")

 

will return 0, and SUBSTR cannot have it as an argument; hence the error. One way to guard against it - and return the full original string if it has no double dashes anywhere is to use the expression for the third SUBSTR argument:

 

ifn (find (str, "--"), find (str, "--") - 1, length (str))

 

Another way is to make "--" what is termed a sentinel by attaching it to the end of STR: 

 

substr (str||"--", 1, find (str||"--", "--") - 1)

 

as it guarantees that "--" is always found. In other words (including a string without double dashes in the test in the last record):

data have ;                                                                           
  input str $char22. ;                                                                
  cards ;                                                                             
jkdfldfd-abc----123                                                                   
dkjfds789-cdegfd---abc1                                                               
asddkj--dmco-hwrfd                                                                    
xyzz-abcd-456                                                                         
run ;                                                                                 
                                                                                      
data want ;                                                                           
  set have ;                                                                          
  str1 = substr (str, 1, ifn (find (str, "--"), find (str, "--") - 1, length (str))) ;
* str1 = substr (str||"--", 1, find (str||"--", "--") - 1) ; *same results;            
run ;                                                                                 

Kind regards

Paul D.

 

 



 

ballardw
Super User

@goyalrk wrote:

Thanks @hashman! For some reasons, both the alternatives below are returning an error: "invalid third argument". Am I missing something?


Post the code and comments from the log. Paste into a code box opened using the forum's {I} or "running man" icon.

We can't see what actual code you submit without log. And the messages often include diagnostics that you don't mention.

 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 4522 views
  • 0 likes
  • 4 in conversation