trying to make my ETL code more succint for my SAS explore paper.
I want to pull up all records that have the strings "$" and "million" in them. Here's the code that I'm currently using. Wondering if anyone has a chance to look at it & provide more succinct code. maybe 4-5 lines?
any help much appreciated.
*create table;
data testcount;
infile datalines delimiter='|';
input Notes: $526. ;
datalines;
In 2009, Cirque du Soleil was given a $34.4 million, five-year contract by the city of Quebec to create free summer productions for tourists. The show, titled Les Chemins invisibles, premiered a different installment each year for five years between 2009 and 201"
Cirque du Soleil created a $3.3 million production involving 150 artists to commemorate the 400th anniversary of Quebec City in October 2008, running for a total of 5 performances.[176][177]"
Allavita! was an hour-long show featuring 48 artists created for Expo 2015 in Milan. The €8 million production was performed in an open-air theatre built by Expo 2015.[185]
;
run;
*find string;
data TestCount;
length MillionText $45;
set testcount;
*where find(Notes,'million')>0;
Notes=put(Notes,$hex1000.);
Notes=tranwrd(Notes,'C2A0','20');
Notes=input(Notes, $hex1000.);
MillionCount = 0;
NumWords=countw(Notes);
do i= 1 to NumWords;
word = scan(Notes, i, ' ');
if compress(word,',') = 'million' then MillionCount=i;
end;
if MillionCount ne 0 then MillionText=compress(catx(' ', scan(notes,MillionCount-1,' ') , scan(notes,MillionCount,' ') ), ',');
drop date Location 'Name or event(s)'n city countryname;
run;
Hi @sqlGoddess
The following code extracts the same as your code from your test data, but it doesn't exactly match your specification. To make it work like your code, I made it scan for € as well as $. This is easy to correct, just replace [$|€] with $.
data testcount;
infile datalines delimiter='|';
input Notes: $526. ;
datalines;
In 2009, Cirque du Soleil was given a $34.4 million, five-year contract by the city of Quebec to create free summer productions for tourists. The show, titled Les Chemins invisibles, premiered a different installment each year for five years between 2009 and 201"
Cirque du Soleil created a $3.3 million production involving 150 artists to commemorate the 400th anniversary of Quebec City in October 2008, running for a total of 5 performances.[176][177]"
Allavita! was an hour-long show featuring 48 artists created for Expo 2015 in Milan. The €8 million production was performed in an open-air theatre built by Expo 2015.[185]
;
run;
data testcount2;
length MillionText $45;
set testcount;
Notes=tranwrd(Notes,'C2A0'x,'20'x);
MillionText = prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes);
run;
Normally when using a hex value the quoted text is followed by an x. 'A0'x for example.
It may help us if you showed how you read the source into SAS. That might be the simplest place to fix things.
To truly represent that source you may have to attach a file. When I copy your data step and run the code there aren't any of your problem characters. Quite likely because a data step that I run will use my language settings and so none of the conversion between UTF and WLatin are involved.
You will have to show exactly what you did in "Using the ‘A0’ specification for hex, made the the entire data set unreadable. ". I have a hard time seeing how such could happen unless possibly used as a delimiter for an Infile statement when not actually present.
Forgot I could attach a file, here is the excel file that I read in with the libname statement
%let path=
libname xl xlsx "&path\cdata.xlsx";
Any character that is not a normal 7-bit ASCII code will require 2 or more bytes in UTF-8.
Using TRANWRD(), which does not understand multi-byte characters), can lead to changing just some of a multi-byte character and result in invalid UTF-8 codes.
Use KTRANSLATE() to convert characters in UTF-8 strings. So to replace non-breaking spaces with normal spaces use:
string = ktranslate(string,' ','C2A0'x);
Hi @sqlGoddess
The following code extracts the same as your code from your test data, but it doesn't exactly match your specification. To make it work like your code, I made it scan for € as well as $. This is easy to correct, just replace [$|€] with $.
data testcount;
infile datalines delimiter='|';
input Notes: $526. ;
datalines;
In 2009, Cirque du Soleil was given a $34.4 million, five-year contract by the city of Quebec to create free summer productions for tourists. The show, titled Les Chemins invisibles, premiered a different installment each year for five years between 2009 and 201"
Cirque du Soleil created a $3.3 million production involving 150 artists to commemorate the 400th anniversary of Quebec City in October 2008, running for a total of 5 performances.[176][177]"
Allavita! was an hour-long show featuring 48 artists created for Expo 2015 in Milan. The €8 million production was performed in an open-air theatre built by Expo 2015.[185]
;
run;
data testcount2;
length MillionText $45;
set testcount;
Notes=tranwrd(Notes,'C2A0'x,'20'x);
MillionText = prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes);
run;
Just what I was hoping to get, some dense PERL Code. thank you. @ErikLund_Jensen
next question, How Would I locate just those 3 rows of data that have the string million in them.
my original thought was to craft a where clause in proc print against the original data, but I would love to see if prxchange can do more like a boolean expression, & return only those rows with a match..
@sqlGoddess wrote:
next question, How Would I locate just those 3 rows of data that have the string million in them.
my original thought was to craft a where clause in proc print against the original data, but I would love to see if prxchange can do more like a boolean expression, & return only those rows with a match..
If you are looking for a relatively infrequent occurrence and already using a data step to add something like your MillionText variable, check if it is not missing and when present write to another data set.
data testcount2 Millions; length MillionText $45; set testcount; Notes=tranwrd(Notes,'C2A0'x,'20'x); MillionText = prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes); if upcase(scan(milliontext,2,' '))='MILLION' then output millions; output testcount2; run;
Though perhaps @ErikLund_Jensen can come up with a PRX that only assigns milliontext when million is present. I added some dummy note values and it copied their values to milliontext.
Hi @sqlGoddess
Here are some different ways. My primitive prxchange extracts MillionText from Notes if a string containing corrency + number + 'million' is present, otherwise it returns the full Notes-text, so they are all based on a comparison between the extractes text and the full note. I think it could be done more elegant.
* Test data - row 3 doesn't contain million;
data testcount;
infile datalines delimiter='|';
input Notes: $526. ;
datalines;
In 2009, Cirque du Soleil was given a $34.4 million, five-year contract by the city of Quebec to create free summer productions for tourists. The show, titled Les Chemins invisibles, premiered a different installment each year for five years between 2009 and 201"
Cirque du Soleil created a $3.3 million production involving 150 artists to commemorate the 400th anniversary of Quebec City in October 2008, running for a total of 5 performances.[176][177]"
Cirque du Soleil created an expensive production involving 150 artists to commemorate the 400th anniversary of Quebec City in October 2008, running for a total of 5 performances.[176][177]"
Allavita! was an hour-long show featuring 48 artists created for Expo 2015 in Milan. The €8 million production was performed in an open-air theatre built by Expo 2015.[185]
;
run;
* Extract notes containing 'million', generate Milliontext;
data want1;
length MillionText $45;
set testcount;
Notes=tranwrd(Notes,'C2A0'x,'20'x);
MillionText = prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes);
if MillionText not =: Notes then output;
run;
* Extract notes containing 'million', generate Milliontext;
proc sql;
create table want2 as
select
Notes,
prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes) as MillionText
from testcount
where calculated MillionText ne Notes;
quit;
* Extract notes containing 'million';
proc sql;
create table want3 as
select Notes
from TestCount
where prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes) ne Notes;
quit;
* Extract notes containing 'million';
data want4;
set testcount(where=(prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes) not =: Notes));
run;
* Print notes containing 'million';
proc print data=testcount(where=(prxchange('s/(.*)([$|€]\s*\d*\.*\d*\s+million)(.*)/$2/i',-1,Notes) not =: Notes));
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.