Solved: Use values from an existing column to fill missing values in other col...

Vicente95 · Posted 06-30-2021 12:38 AM

Hello everyone,

I am new to SAS programming, and working on a personal project to analyze used cars listing on Craigslist between April and May 2021. Here is the dataset that I am using: https://www.kaggle.com/austinreese/craigslist-carstrucks-data

I found a similar blog post on the towards data science website (https://towardsdatascience.com/end-to-end-data-science-project-predicting-used-car-prices-using-regr...) where the author, Jose Portillo, used an older dataset with the same variables. However, Jose coded his project in Python whereas I am doing mine in SAS.

import re

manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'

keys =    ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer,   condition,   fuel,  title_status, transmission ,drive, size, type_, paint_color,   cylinders]

for i,column in zip(keys,columns):
    database[i] = database[i].fillna(
      database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()

database.drop('description', axis=1, inplace= True)

There are some minor differences such as the drive, type, and num_of_cylinders. However, almost all the variables are the same. If we hold all the variables constant, how can I write the same code in SAS to extract data from the description column to fill in the missing values in the manufacturer, condition, fuel, transmission, and so on columns?

I think I have created the list correctly, however here is what I have done so far to format each column and its observations into a list.

%let manufacturer = (gmc,hyundai,toyota,mitsubishi,ford,chevrolet,ram,buick,jeep,dodge,subaru,nissan,audi,rover,lexus,honda,chrysler,mini,pontiac,mercedes-benz,cadillac,bmw,kia,volvo,volkswagen,jaguar,acura,saturn,mazda,mercury,lincoln,infiniti,ferrari,fiat,tesla,land rover,harley-davidson,datsun,alfa-romeo,morgan,aston-martin,porche,hennessey);

%let condition = (excellent,good,fair,like new,salvage,new);

%let fuel = (gas,hybrid,diesel,electric);

%let title_status = (clean,lien,rebuilt,salvage,missing,parts only);

%let transmission = (automatic,manual);

%let drive = (4wd,fwd,rwd);

%let size = (mid-size,full-size,compact,sub-compact);

%let type = (sedan,truck,SUV,mini-van,wagon,hatchback,coupe,pickup,convertible,van,bus,offroad);

%let paint_color = (red,grey,blue,white,custom,silver,brown,black,purple,green,orange,yellow);

%let cylinders = (s[1-9] cylinders?, s1[0-6]? cylinders?);

Tom · Posted 06-30-2021 01:29 PM

Try this:

data want;
  set have;

  array columns  manufacturer condition fuel title_status transmission drive size type paint_color  ;
  array strings [9] $300 _temporary_ (
    /*manufacturer*/ 'gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus 
    | honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | 
    mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche 
    | hennessey'
    /*condition*/ 'excellent | good | fair | like new | salvage | new'
    /*fuel*/ 'gas | hybrid | diesel |electric'
    /*title_status*/ 'clean | lien | rebuilt | salvage | missing | parts only'
    /*transmission*/ 'automatic | manual'
    /*drive*/ '4x4 | awd | fwd | rwd | 4wd'
    /*size*/ 'mid-size | full-size | compact | sub-compact'
    /*type*/ 'sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad'
    /*paint_color*/ 'red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow'
  );
  length word $50 ;
  do col=1 to dim(columns);
    do index=1 to countw(strings[col]),'|') while (missing(columns[col])) ;
      word=left(scan(strings[col],index,'|'));
      if findw(description,word,,'it') then columns[col]=word;
    end;
  end;

  if missing(cylinders) then do;
    index=findw(description,'cylinders',,'it');
    if index then do;
      cylinders=scan(substrn(description,1,index-1),-1)||' cylinders';
    end;
  end;

  drop word col index ;
run;

The columns array is the variables that you want to update (when they are missing). The strings temporary array is the list of words for each variable.

So basically it is finding the first word in the list that it finds in the description.

For cylinders I used a separate step to find where the word cylinders appears and then prefix the last word that appears before that.

View solution in original post

ChrisNZ · Posted 06-30-2021 01:47 AM

I am unsure what you want, but the cylinders variable contains a (corrupt) regular expression rather than a list like the other variables.

High-Performance SAS Coding - Third Edition

Vicente95 · Posted 06-30-2021 02:02 AM

So, I want the equivalent SAS code of what is contained in the screenshot since the author wrote it in Python. What would the SAS code resemble?

As for the cylinders list, it would be:

%let cylinders=(3 cylinders,4 cylinders,5 cylinders,6 cylinders,8 cylinders,10 cylinders,12 cylinders);

Is there a regex to shorten the cylinders list in SAS?

ChrisNZ · Posted 06-30-2021 02:11 AM

I don't know what the python code does.

As for the regex, yes SAS can use them, and the macro variable can contain a regex. It just depends how the macro variable is used.

The other macros variables do not contain a regex, so this one stuck out.

Note that it'sall regular expressions in python, so I'm unsure why you built the SAS values the way you did.

High-Performance SAS Coding - Third Edition

Reeza · Posted 06-30-2021 10:38 AM

A lot of what that code is doing is not required in SAS though....you've truncated the code in the picture. I took a quick look at the data and there's no description field in the text so you need to clarify what you're asking for help with.
Also, since the data is CSV you control the the type and format when importing the data so you can ensure the data is read in correctly.

Vicente95 · Posted 06-30-2021 10:58 AM

In a nutshell, I am seeking the equivalent SAS code of the following Python code shown below:

import re

manufacturer = '(gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus \
| honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | \
mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche \
| hennessey)'
condition = '(excellent | good | fair | like new | salvage | new)'
fuel = '(gas | hybrid | diesel |electric)'
title_status = '(clean | lien | rebuilt | salvage | missing | parts only)'
transmission = '(automatic | manual)'
drive = '(4x4 | awd | fwd | rwd | 4wd)'
size = '(mid-size | full-size | compact | sub-compact)'
type_ = '(sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad)'
paint_color = '(red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow)'
cylinders = '(\s[1-9] cylinders? |\s1[0-6]? cylinders?)'

keys =    ['manufacturer', 'condition', 'fuel', 'title_status', 'transmission', 'drive','size', 'type', 'paint_color' , 'cylinders']
columns = [ manufacturer,   condition,   fuel,  title_status, transmission ,drive, size, type_, paint_color,   cylinders]

for i,column in zip(keys,columns):
    database[i] = database[i].fillna(
      database['description'].str.extract(column, flags=re.IGNORECASE, expand=False)).str.lower()

database.drop('description', axis=1, inplace= True)

I have a large dataset in SAS that has 17 variables of which four are numeric and 13 character/string. The original dataset that I am using can be found here: https://www.kaggle.com/austinreese/craigslist-carstrucks-data.

cylinders
condition
drive
paint_color
type
manufacturer
title_status
model
fuel
transmission
description
region
state
price (num)
posting_date (num)
odometer (num)
year (num)

After applying specific filters to the numeric columns, there are no missing values for each numeric variable. However, there are thousands to hundreds of thousands of missing variables for the remaining 14 char/string variables. How can I write the equivalent SAS code where I use regex on the description column to fill missing values of the other string/char columns with categorical values such as cylinders, condition, drive, paint_color, and so on?

Reeza · Posted 06-30-2021 11:09 AM

I posted an example in your StackOverflow question - yeah, Kaggle only shows the first few columns by default apparently. But it's just doing a basic text search on the description field so you can just replace it with FIND() or INDEX() and some loops. REGEX is also an option but I hate writing regex so someone else will have to help you with that.

Vicente95 · Posted 06-30-2021 11:30 AM

Yes, I received your response. Thank you for that.

Reeza · Posted 06-30-2021 12:01 PM

For completeness or anyone else looking to tackle this:

You can expand this by creating an array for each variable and then looping through your lists. I think you can replace the loop with a REGEX command as well in SAS but regex requires too much thinking so someone else will have to provide that answer.

data want;
set have;
array _fuel(*) $ _temporary_ ("gas", "hybrid", "diesel", "electric");

do i=1 to dim(_fuel);
if find(description, _fuel(i), 'it')>0 then fuel = _fuel(i);
*does not deal with multiple finds so the last one found will be kept;
end;

run;

Other ways of solving this include making a list of your lookup values and categories such as:

Category Value
Fuel gas
Fuel diesel
Cylinders 1 Cylinder
Cylinders 2 Cylinder
....

Then take your description column and expand it so that each word has it's own row, see example below.

https://gist.github.com/statgeek/bed5ea2c12903b38fdcf19f3f1f1aae9

Then merge the two via a left join to get your missing values. One benefit of this approach is that you can then use some fuzzy join logic as well which is a bit harder in the direct match approach used above. For example if the post says 1 cyl that won't match to 1 cyl but you could do partial searches. I realized there's actually no regex in your original program except for the cylinders specification and I'm not 100% sure what the fillNA is doing with the matches.

It would probably help if you posted a few lines of the data and the output you get from python.

Tom · Posted 06-30-2021 01:29 PM

Try this:

data want;
  set have;

  array columns  manufacturer condition fuel title_status transmission drive size type paint_color  ;
  array strings [9] $300 _temporary_ (
    /*manufacturer*/ 'gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus 
    | honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda | 
    mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche 
    | hennessey'
    /*condition*/ 'excellent | good | fair | like new | salvage | new'
    /*fuel*/ 'gas | hybrid | diesel |electric'
    /*title_status*/ 'clean | lien | rebuilt | salvage | missing | parts only'
    /*transmission*/ 'automatic | manual'
    /*drive*/ '4x4 | awd | fwd | rwd | 4wd'
    /*size*/ 'mid-size | full-size | compact | sub-compact'
    /*type*/ 'sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad'
    /*paint_color*/ 'red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow'
  );
  length word $50 ;
  do col=1 to dim(columns);
    do index=1 to countw(strings[col]),'|') while (missing(columns[col])) ;
      word=left(scan(strings[col],index,'|'));
      if findw(description,word,,'it') then columns[col]=word;
    end;
  end;

  if missing(cylinders) then do;
    index=findw(description,'cylinders',,'it');
    if index then do;
      cylinders=scan(substrn(description,1,index-1),-1)||' cylinders';
    end;
  end;

  drop word col index ;
run;

The columns array is the variables that you want to update (when they are missing). The strings temporary array is the list of words for each variable.

So basically it is finding the first word in the list that it finds in the description.

For cylinders I used a separate step to find where the word cylinders appears and then prefix the last word that appears before that.

Vicente95 · Posted 06-30-2021 01:43 PM

Nice! This is what I was looking for 🙂 I will try the code and let you know how it goes.

Vicente95 · Posted 06-30-2021 06:43 PM

Unfortunately, the code did not work. Do you know what the error messages mean?

3004  DATA want;
3005      SET have;
3006
3007      array columns  manufacturer condition fuel title_status transmission drive size type paint_color;
3008      array strings [9] $300 _temporary_ (
3009      /*manufacturer*/ 'gmc | hyundai | toyota | mitsubishi | ford | chevrolet | ram | buick | jeep | dodge | subaru | nissan | audi | rover  | lexus
3010      | honda | chrysler | mini | pontiac | mercedes-benz | cadillac | bmw | kia | volvo | volkswagen | jaguar | acura | saturn | mazda |
3011      mercury | lincoln | infiniti | ferrari | fiat | tesla | land rover | harley-davidson | datsun | alfa-romeo | morgan | aston-martin | porche
NOTE: The quoted string currently being processed has become more than 262 characters long.  You might have unbalanced quotation marks.
3012      | hennessey'
3013      /*condition*/ 'excellent | good | fair | like new | salvage | new'
3014      /*fuel*/ 'gas | hybrid | diesel |electric'
3015      /*title_status*/ 'clean | lien | rebuilt | salvage | missing | parts only'
3016      /*transmission*/ 'automatic | manual'
3017      /*drive*/ '4x4 | awd | fwd | rwd | 4wd'
3018      /*size*/ 'mid-size | full-size | compact | sub-compact'
3019      /*type*/ 'sedan | truck | SUV | mini-van | wagon | hatchback | coupe | pickup | convertible | van | bus | offroad'
3020      /*paint_color*/ 'red | grey | blue | white | custom | silver | brown | black | purple | green | orange | yellow'
3021    );
3022
3023    length word $50;
3024    do col=1 to dim(columns);
3025      do index=1 to countw(strings[col]),'|') while (missing(columns[col])) ;
                                                -
                                                388
                                                200
                                                76
ERROR 388-185: Expecting an arithmetic operator.

ERROR 200-322: The symbol is not recognized and will be ignored.

ERROR 76-322: Syntax error, statement will be ignored.

3026        word=left(scan(strings[col],index,'|'));
3027        if findw(description,word,,'it') then columns[col]=word;
3028      end;
3029    end;
3030
3031    if missing(cylinders) then do;
3032      index=findw(description,'cylinders',,'it');
3033      if index then do;
3034        cylinders=scan(substrn(description,1,index-1),-1)||' cylinders';
3035      end;
3036    end;
3037
3038    drop word col index ;
3039  run;

SASKiwi · Posted 06-30-2021 07:02 PM

There is an extra closing bracket that shouldn't be there. Try this:

do index=1 to countw(strings[col],'|') while (missing(columns[col])) ;

Use values from an existing column to fill missing values in other columns

Re: Use values from an existing column to fill missing values in other columns

Re: Use values from an existing column to fill missing values in other columns

Re: Use values from an existing column to fill missing values in other columns

Re: Use values from an existing column to fill missing values in other columns

Re: Use values from an existing column to fill missing values in other columns

Re: Use values from an existing column to fill missing values in other columns

Re: Use values from an existing column to fill missing values in other columns

Re: Use values from an existing column to fill missing values in other columns

Re: Use values from an existing column to fill missing values in other columns

Re: Use values from an existing column to fill missing values in other columns

Re: Use values from an existing column to fill missing values in other columns

Re: Use values from an existing column to fill missing values in other columns

Re: Use values from an existing column to fill missing values in other columns

Classroom Training Available!