HI,
i need to extract data that will be 5 characters example TG0001 which is machine # . I'm currently dealing with over 5k machines which have a unique number . I tried compress an it just provided me the # not the letters.
the column something like
column
blah blah TG0001
blah DF0001
blah blah FG0001 blah blah
blah GH 0001
In some cases there might be a space between number an alpha . Not pretty the data is manually input
thank you for your
Sorry, my bad.
NEWVAR=PRXCHANGE('s/.+([A-Z]{2}) *(\d+).*/$1$2/o',-1,CUST_NAME);
double check the variable name (not the label). If it does have blanks embedded like you said, then you will need literal reference:
NEWVAR=PRXCHANGE('s/.+([A-Z]{2}) *(\d+).*/$1$2/o',-1,'CUST NAME'n);
Hard to envision how to possibly tackle this without more concrete examples of text lines you are dealing with. Any chance you can provide a sample, with actual text values (of course, changing anything that might be security sensitive)?
If you don't want keep any non-digits at all, and there is no other digits except this one location, then COMPRESS() will be my first option:
DATA HAVE;
INPUT VAR $ 40.;
NUM=COMPRESS(VAR,,'KD');
CARDS;
blah blah TG0001
blah DF0001
blah blah FG0001 blah blah
blah GH 0001
;
Regards,
Haikuo
Based on the specified pattern this may produce the desired output.
data have;
input col $ 30.;
datalines;
blah blah TG2001
blah DF0001
blah blah FG0001 blah blah
blah GH 0004
;
data want;
length machine_number $8.;
set have;
machine_number=catt(substr(col,find(col,compress(col,,'kd'))-3,3),compress(col,,'kd'));
run;
HI stat@sas your snippet of code work pretty good gave me about 80 % which is acceptable ..the issue that I have run across is that the output is coming in with spaces . I did a compress but it removed the 2 character Id and just gave me the numical. Any thoughts how to remove blanks an keep the output intact? Thanks again
Still the PRX functions will offer a succinct solution, if you want to get rid of the blanks:
DATA HAVE;
INPUT VAR $ 40.;
/* NUM=COMPRESS(VAR,,'KD');*/
NEWVAR=PRXCHANGE('s/.+([A-Z]{2}) *(\d+).*/$1$2/o',-1,VAR);
CARDS;
blah blah TG0001
blah DF0001
blah blah FG0001 blah blah
blah GH 0001
;
You will have to show some sample data that is not working. People here can only provide solutions based on your sample data, we don't know your data, you do.
Good luck,
Haikuo
Hai Kuo
i will be refer a table name info where do I make refer to it in that syntax ? The column name is cust name where do I make that refer ..thanks
DATA want;
set YOURDATA;
NEWVAR=PRXCHANGE('s/.+([A-Z]{2}) *(\d+).*/$1$2/o',-1,VAR);
RUN;
HI,
your data is name of table ...where do I make refer to column the name of column is "CUST NAME"
thanks
Sorry, my bad.
NEWVAR=PRXCHANGE('s/.+([A-Z]{2}) *(\d+).*/$1$2/o',-1,CUST_NAME);
double check the variable name (not the label). If it does have blanks embedded like you said, then you will need literal reference:
NEWVAR=PRXCHANGE('s/.+([A-Z]{2}) *(\d+).*/$1$2/o',-1,'CUST NAME'n);
HEre is the error I'm getting
DATA Info;
822 set info
823 /* NUM=COMPRESS(VAR,,'KD');*/
824 NEWVAR=PRXCHANGE('s/.+([A-Z]{2}) *(\d+).*/$1$2/o',-1,'CUST_NAME'n);
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string, (, -, :, ;,
END, INDSNAME, KEY, KEYS, NOBS, OPEN, POINT, _DATA_, _LAST_, _NULL_.
ERROR 76-322: Syntax error, statement will be ignored.
825
826 ;
827 run;
you are missing a ";' after set info.
Never mind I fig I was missing a ; it works thanks
Hi,
This is based on what you provided. If you can provide some extract from 20% where this is not working will help to understand your problem in a better way.
Thanks,
Hi BETO,
It is difficult to produce a solution that works without knowing the exact structure of the data in which it is contained.
Here is an approach using PERL regular expressions.
DATA HAVE;
INPUT COL $ 30.;
IF _N_ = 1 THEN
DO;
RETAIN PATTERN1;
PATTERN1 = PRXPARSE("/(\w{2}\d{4})|(\w{2}\s\d{4})/");
END;
CALL PRXSUBSTR(PATTERN1,COL,START,LENGTH);
COL1 = COMPRESS(SUBSTR(COL,START,LENGTH));
DATALINES;
BLAH BLAH TG2001
BLAH DF0001
BLAH BLAH FG0001 BLAH BLAH
BLAH GH 0004
;
RUN;
The ("/(\w{2}\d{4})|(\w{2}\s\d{4})/") portion of the code defines the pattern you wish to match. In this case 2 word elements followed by 4 digits OR 2 word elements followed by a space and 4 digits. CALL PRXSUBSTR finds a string that matches the conditions in PATTERN1 and produces 2 variables which we have named START and LENGTH. We then use the values in these 2 variables to extract a string from the original COL variable using the SUBSTR function;
Regards,
Scott
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.