BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
kevinma28
Fluorite | Level 6

Hello, I'm still relatively new to SAS. I got a dataset which needs to be cleaned. The dataset has several rows shifted to the left, so for example the age is in the name column, etc.

Example:

IDNameAgeSex
1001Marcel20M
.25M.

Desired output:

IDNameAgeSex
1001Marcel20M
..25M

How can I shift the affected rows to the right? Missing values are allowed. Thanks before.

 

Edit:

I attached the Listings.csv dataset that I was given to be cleaned up and analyzed. I noticed some strange values in neighbourhood_group column and after looking at the records, I recognized that those records are shifted. The reason I want to shift these records is to retrieve any valuable data for analysis from these records to get a more accurate result.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

I ran this code after uploading the file to my ODA account:

data want;
infile '~/Listings.csv' dlm="," dsd truncover firstobs=2;
input
  id :$10.
  name :$50.
  host_id :$10.
  host_name :$20.
  neighbourhood_group :$20.
  neighbourhood :$20.
  latitude
  longitude
  room_type :$20.
  price
  minimum_nights
  number_of_reviews
  last_review :mmddyy10.
  reviews_per_month
  calculated_host_listings_count
  availability_365
;
format last_review yymmdd10.;
run;

I found 12 problematic rows.

The offending rows are caused by CR/LF/CRLF characters in the name column. You need to clean those out first before you read the data. Search the communities for "line breaks in text files" or similar; you will find a suitable method for removing those unwanted line terminators. To know what you need to filter out, use an advanced text editor like Notepad++ that can show you the characters.

 

Edit: TERMSTR=CRLF, as suggested by @Ksharp fixes it. So you have single CR or LF characters in a Windows-style terminated text file.

View solution in original post

11 REPLIES 11
PeterClemmensen
Tourmaline | Level 20

How do you know what observations are affected?

AMSAS
SAS Super FREQ

What @PeterClemmensen said, how do you know?
It may seem obvious to us humans (at least this specific example) but not so obvious to the computer
Where's the data coming from? It probably needs correcting at the source, or maybe you have a coding issue when reading it in

ballardw
Super User

@kevinma28 wrote:

Hello, I'm still relatively new to SAS. I got a dataset which needs to be cleaned. The dataset has several rows shifted to the left, so for example the age is in the name column, etc.

Example:

ID Name Age Sex
1001 Marcel 20 M
. 25 M .

Desired output:

ID Name Age Sex
1001 Marcel 20 M
. . 25 M

How can I shift the affected rows to the right? Missing values are allowed. Thanks before.


I am going to say that your desired output should be:

 

ID Name Age Sex
1001 Marcel 20 M
1001 Marcel 25 M

Records without all identification values are just waiting to be a serious problem. Sort on ID? then that Age and Sex no longer have any meaning. Sort on Name, same thing. Attempt to match records to another data set on ID and you have a failure, or incomplete data without the ID values on every record.

 

One suspects the culprit is how the data was read into SAS. Were you responsible for that? If so, show the code used. Best would be to provide an example of the data source file. If you did not read the data, go talk to whoever did read the data as this type of problem indicates a process failure.

 

Having seen such things in the past one suspects that a report type file was read and may have many other missing elements or incorrect data. For instance, if your AGE is supposed to be numeric then you have an issue with this data as it must be Character.

 

kevinma28
Fluorite | Level 6
Thanks for your reply, I've edited my question and attached the data source file.
Kurt_Bremser
Super User

I ran this code after uploading the file to my ODA account:

data want;
infile '~/Listings.csv' dlm="," dsd truncover firstobs=2;
input
  id :$10.
  name :$50.
  host_id :$10.
  host_name :$20.
  neighbourhood_group :$20.
  neighbourhood :$20.
  latitude
  longitude
  room_type :$20.
  price
  minimum_nights
  number_of_reviews
  last_review :mmddyy10.
  reviews_per_month
  calculated_host_listings_count
  availability_365
;
format last_review yymmdd10.;
run;

I found 12 problematic rows.

The offending rows are caused by CR/LF/CRLF characters in the name column. You need to clean those out first before you read the data. Search the communities for "line breaks in text files" or similar; you will find a suitable method for removing those unwanted line terminators. To know what you need to filter out, use an advanced text editor like Notepad++ that can show you the characters.

 

Edit: TERMSTR=CRLF, as suggested by @Ksharp fixes it. So you have single CR or LF characters in a Windows-style terminated text file.

kevinma28
Fluorite | Level 6
Ah, I see. Thank you for the solution.
Patrick
Opal | Level 21

@kevinma28 How did you create the data set with the "shifted" values? Ideally the issue gets resolved as early in the process as possible.

andreas_lds
Jade | Level 19

Looks like something went wrong during importing data, so please post the code you have used to read the file and the log, so that we can help fixing the problem where it occurred.

kevinma28
Fluorite | Level 6

Hello guys, thanks for the replies. I edited my question with more information.

This is what I get if I open the dataset at Excel. I believe there is no problem in my import code as this is a problem in the dataset.

kevinma28_1-1634646519362.png

When I filter it by neighbourhood_group, there should have been only 5 regions (Central, West, East, North, Northeast) in the 5th column. However I noticed that the filter have other values and I realized that it is the value from another column besides it. I need to clean the data so I want to shift the columns so at least some values can still be used for analysis.

Ksharp
Super User

I didn't see any shift record .everything is ok.

 

filename x 'c:\temp\listings.csv' encoding='utf8' termstr=crlf;
proc import datafile=x out=have dbms=csv replace;
getnames=yes;
guessingrows=max;
run;

Ksharp_0-1634647496642.png

 

 

kevinma28
Fluorite | Level 6
Thank you for the solution

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 11 replies
  • 1554 views
  • 1 like
  • 8 in conversation