Hello everyone,
I am new to SAS programming, and working on a personal project to analyze used cars listing on Craigslist between April and May 2021. Here is the dataset that I am using: https://www.kaggle.com/austinreese/craigslist-carstrucks-data
I found a similar blog post on the towards data science website (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regr...) where the author, Jose Portillo, used an older dataset with the same variables. However, Jose coded his project in Python whereas I am doing mine in SAS.
import re manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover | lexus \ | honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \ mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \ | hennessey)' condition = '(excellent | good | fair | like new | salvage | new)' fuel = '(gas | hybrid | diesel |electric)' title_status = '(clean | lien | rebuilt | salvage | missing | parts only)' transmission = '(automatic | manual)' drive = '(4x4 | awd | fwd | rwd | 4wd)' size = '(mid-size | full-size | compact | sub-compact)' type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)' paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)' cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)' keys = ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders'] columns = [ manufacturer, condition, fuel, title_status, transmission ,drive, size, type_, paint_color, cylinders] for i,column in zip(keys,columns): database[i] = database[i].fillna( database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower() database.drop('description', axis=1, inplace= True)
There are some minor differences such as the drive, type, and num_of_cylinders. However, almost all the variables are the same. If we hold all the variables constant, how can I write the same code in SAS to extract data from the description column to fill in the missing values in the manufacturer, condition, fuel, transmission, and so on columns?
I think I have created the list correctly, however here is what I have done so far to format each column and its observations into a list.
%let manufacturer = (gmc,hyundai,toyota,mitsubishi,ford,chevrolet,ram,buick,jeep,dodge,subaru,nissan,audi,rover,lexus,honda,chrysler,mini,pontiac,mercedes-benz,cadillac,bmw,kia,volvo,volkswagen,jaguar,acura,saturn,mazda,mercury,lincoln,infiniti,ferrari,fiat,tesla,land rover,harley-davidson,datsun,alfa-romeo,morgan,aston-martin,porche,hennessey);
%let condition = (excellent,good,fair,like new,salvage,new);
%let fuel = (gas,hybrid,diesel,electric);
%let title_status = (clean,lien,rebuilt,salvage,missing,parts only);
%let transmission = (automatic,manual);
%let drive = (4wd,fwd,rwd);
%let size = (mid-size,full-size,compact,sub-compact);
%let type = (sedan,truck,SUV,mini-van,wagon,hatchback,coupe,pickup,convertible,van,bus,offroad);
%let paint_color = (red,grey,blue,white,custom,silver,brown,black,purple,green,orange,yellow);
%let cylinders = (s[1-9] cylinders?, s1[0-6]? cylinders?);
Try this:
data want;
set have;
array columns manufacturer condition fuel title_status transmission drive size type paint_color ;
array strings [9] $300 _temporary_ (
/*manufacturer*/ 'gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover | lexus
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda |
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche
| hennessey'
/*condition*/ 'excellent | good | fair | like new | salvage | new'
/*fuel*/ 'gas | hybrid | diesel |electric'
/*title_status*/ 'clean | lien | rebuilt | salvage | missing | parts only'
/*transmission*/ 'automatic | manual'
/*drive*/ '4x4 | awd | fwd | rwd | 4wd'
/*size*/ 'mid-size | full-size | compact | sub-compact'
/*type*/ 'sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad'
/*paint_color*/ 'red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow'
);
length word $50 ;
do col=1 to dim(columns);
do index=1 to countw(strings[col]),'|') while (missing(columns[col])) ;
word=left(scan(strings[col],index,'|'));
if findw(description,word,,'it') then columns[col]=word;
end;
end;
if missing(cylinders) then do;
index=findw(description,'cylinders',,'it');
if index then do;
cylinders=scan(substrn(description,1,index-1),-1)||' cylinders';
end;
end;
drop word col index ;
run;
The columns array is the variables that you want to update (when they are missing). The strings temporary array is the list of words for each variable.
So basically it is finding the first word in the list that it finds in the description.
For cylinders I used a separate step to find where the word cylinders appears and then prefix the last word that appears before that.
I am unsure what you want, but the cylinders variable contains a (corrupt) regular expression rather than a list like the other variables.
I don't know what the python code does.
As for the regex, yes SAS can use them, and the macro variable can contain a regex. It just depends how the macro variable is used.
The other macros variables do not contain a regex, so this one stuck out.
Note that it'sall regular expressions in python, so I'm unsure why you built the SAS values the way you did.
In a nutshell, I am seeking the equivalent SAS code of the following Python code shown below:
import re manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover | lexus \ | honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \ mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \ | hennessey)' condition = '(excellent | good | fair | like new | salvage | new)' fuel = '(gas | hybrid | diesel |electric)' title_status = '(clean | lien | rebuilt | salvage | missing | parts only)' transmission = '(automatic | manual)' drive = '(4x4 | awd | fwd | rwd | 4wd)' size = '(mid-size | full-size | compact | sub-compact)' type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)' paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)' cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)' keys = ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders'] columns = [ manufacturer, condition, fuel, title_status, transmission ,drive, size, type_, paint_color, cylinders] for i,column in zip(keys,columns): database[i] = database[i].fillna( database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower() database.drop('description', axis=1, inplace= True)
I have a large dataset in SAS that has 17 variables of which four are numeric and 13 character/string. The original dataset that I am using can be found here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data.
After applying specific filters to the numeric columns, there are no missing values for each numeric variable. However, there are thousands to hundreds of thousands of missing variables for the remaining 14 char/string variables. How can I write the equivalent SAS code where I use regex on the description column to fill missing values of the other string/char columns with categorical values such as cylinders, condition, drive, paint_color, and so on?
Yes, I received your response. Thank you for that.
For completeness or anyone else looking to tackle this:
You can expand this by creating an array for each variable and then looping through your lists. I think you can replace the loop with a REGEX command as well in SAS but regex requires too much thinking so someone else will have to provide that answer.
data want;
set have;
array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric");
do i=1 to dim(_fuel);
if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i);
*does not deal with multiple finds so the last one found will be kept;
end;
run;
Other ways of solving this include making a list of your lookup values and categories such as:
Category Value
Fuel gas
Fuel diesel
Cylinders 1 Cylinder
Cylinders 2 Cylinder
....
Then take your description column and expand it so that each word has it's own row, see example below.
https://gist.github.com/statgeek/bed5ea2c12903b38fdcf19f3f1f1aae9
Then merge the two via a left join to get your missing values. One benefit of this approach is that you can then use some fuzzy join logic as well which is a bit harder in the direct match approach used above. For example if the post says 1 cyl that won't match to 1 cyl but you could do partial searches. I realized there's actually no regex in your original program except for the cylinders specification and I'm not 100% sure what the fillNA is doing with the matches.
It would probably help if you posted a few lines of the data and the output you get from python.
Try this:
data want;
set have;
array columns manufacturer condition fuel title_status transmission drive size type paint_color ;
array strings [9] $300 _temporary_ (
/*manufacturer*/ 'gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover | lexus
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda |
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche
| hennessey'
/*condition*/ 'excellent | good | fair | like new | salvage | new'
/*fuel*/ 'gas | hybrid | diesel |electric'
/*title_status*/ 'clean | lien | rebuilt | salvage | missing | parts only'
/*transmission*/ 'automatic | manual'
/*drive*/ '4x4 | awd | fwd | rwd | 4wd'
/*size*/ 'mid-size | full-size | compact | sub-compact'
/*type*/ 'sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad'
/*paint_color*/ 'red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow'
);
length word $50 ;
do col=1 to dim(columns);
do index=1 to countw(strings[col]),'|') while (missing(columns[col])) ;
word=left(scan(strings[col],index,'|'));
if findw(description,word,,'it') then columns[col]=word;
end;
end;
if missing(cylinders) then do;
index=findw(description,'cylinders',,'it');
if index then do;
cylinders=scan(substrn(description,1,index-1),-1)||' cylinders';
end;
end;
drop word col index ;
run;
The columns array is the variables that you want to update (when they are missing). The strings temporary array is the list of words for each variable.
So basically it is finding the first word in the list that it finds in the description.
For cylinders I used a separate step to find where the word cylinders appears and then prefix the last word that appears before that.
Nice! This is what I was looking for 🙂 I will try the code and let you know how it goes.
Unfortunately, the code did not work. Do you know what the error messages mean?
3004 DATA want;
3005 SET have;
3006
3007 array columns manufacturer condition fuel title_status transmission drive size type paint_color;
3008 array strings [9] $300 _temporary_ (
3009 /*manufacturer*/ 'gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover | lexus
3010 | honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda |
3011 mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche
NOTE: The quoted string currently being processed has become more than 262 characters long. You might have unbalanced quotation marks.
3012 | hennessey'
3013 /*condition*/ 'excellent | good | fair | like new | salvage | new'
3014 /*fuel*/ 'gas | hybrid | diesel |electric'
3015 /*title_status*/ 'clean | lien | rebuilt | salvage | missing | parts only'
3016 /*transmission*/ 'automatic | manual'
3017 /*drive*/ '4x4 | awd | fwd | rwd | 4wd'
3018 /*size*/ 'mid-size | full-size | compact | sub-compact'
3019 /*type*/ 'sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad'
3020 /*paint_color*/ 'red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow'
3021 );
3022
3023 length word $50;
3024 do col=1 to dim(columns);
3025 do index=1 to countw(strings[col]),'|') while (missing(columns[col])) ;
-
388
200
76
ERROR 388-185: Expecting an arithmetic operator.
ERROR 200-322: The symbol is not recognized and will be ignored.
ERROR 76-322: Syntax error, statement will be ignored.
3026 word=left(scan(strings[col],index,'|'));
3027 if findw(description,word,,'it') then columns[col]=word;
3028 end;
3029 end;
3030
3031 if missing(cylinders) then do;
3032 index=findw(description,'cylinders',,'it');
3033 if index then do;
3034 cylinders=scan(substrn(description,1,index-1),-1)||' cylinders';
3035 end;
3036 end;
3037
3038 drop word col index ;
3039 run;
There is an extra closing bracket that shouldn't be there. Try this:
do index=1 to countw(strings[col],'|') while (missing(columns[col])) ;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.