Hello, I'm still relatively new to SAS. I got a dataset which needs to be cleaned. The dataset has several rows shifted to the left, so for example the age is in the name column, etc.
Example:
ID | Name | Age | Sex |
1001 | Marcel | 20 | M |
. | 25 | M | . |
Desired output:
ID | Name | Age | Sex |
1001 | Marcel | 20 | M |
. | . | 25 | M |
How can I shift the affected rows to the right? Missing values are allowed. Thanks before.
Edit:
I attached the Listings.csv dataset that I was given to be cleaned up and analyzed. I noticed some strange values in neighbourhood_group column and after looking at the records, I recognized that those records are shifted. The reason I want to shift these records is to retrieve any valuable data for analysis from these records to get a more accurate result.
I ran this code after uploading the file to my ODA account:
data want;
infile '~/Listings.csv' dlm="," dsd truncover firstobs=2;
input
id :$10.
name :$50.
host_id :$10.
host_name :$20.
neighbourhood_group :$20.
neighbourhood :$20.
latitude
longitude
room_type :$20.
price
minimum_nights
number_of_reviews
last_review :mmddyy10.
reviews_per_month
calculated_host_listings_count
availability_365
;
format last_review yymmdd10.;
run;
I found 12 problematic rows.
The offending rows are caused by CR/LF/CRLF characters in the name column. You need to clean those out first before you read the data. Search the communities for "line breaks in text files" or similar; you will find a suitable method for removing those unwanted line terminators. To know what you need to filter out, use an advanced text editor like Notepad++ that can show you the characters.
Edit: TERMSTR=CRLF, as suggested by @Ksharp fixes it. So you have single CR or LF characters in a Windows-style terminated text file.
How do you know what observations are affected?
What @PeterClemmensen said, how do you know?
It may seem obvious to us humans (at least this specific example) but not so obvious to the computer
Where's the data coming from? It probably needs correcting at the source, or maybe you have a coding issue when reading it in
@kevinma28 wrote:
Hello, I'm still relatively new to SAS. I got a dataset which needs to be cleaned. The dataset has several rows shifted to the left, so for example the age is in the name column, etc.
Example:
ID Name Age Sex 1001 Marcel 20 M . 25 M . Desired output:
ID Name Age Sex 1001 Marcel 20 M . . 25 M How can I shift the affected rows to the right? Missing values are allowed. Thanks before.
I am going to say that your desired output should be:
ID | Name | Age | Sex |
1001 | Marcel | 20 | M |
1001 | Marcel | 25 | M |
Records without all identification values are just waiting to be a serious problem. Sort on ID? then that Age and Sex no longer have any meaning. Sort on Name, same thing. Attempt to match records to another data set on ID and you have a failure, or incomplete data without the ID values on every record.
One suspects the culprit is how the data was read into SAS. Were you responsible for that? If so, show the code used. Best would be to provide an example of the data source file. If you did not read the data, go talk to whoever did read the data as this type of problem indicates a process failure.
Having seen such things in the past one suspects that a report type file was read and may have many other missing elements or incorrect data. For instance, if your AGE is supposed to be numeric then you have an issue with this data as it must be Character.
I ran this code after uploading the file to my ODA account:
data want;
infile '~/Listings.csv' dlm="," dsd truncover firstobs=2;
input
id :$10.
name :$50.
host_id :$10.
host_name :$20.
neighbourhood_group :$20.
neighbourhood :$20.
latitude
longitude
room_type :$20.
price
minimum_nights
number_of_reviews
last_review :mmddyy10.
reviews_per_month
calculated_host_listings_count
availability_365
;
format last_review yymmdd10.;
run;
I found 12 problematic rows.
The offending rows are caused by CR/LF/CRLF characters in the name column. You need to clean those out first before you read the data. Search the communities for "line breaks in text files" or similar; you will find a suitable method for removing those unwanted line terminators. To know what you need to filter out, use an advanced text editor like Notepad++ that can show you the characters.
Edit: TERMSTR=CRLF, as suggested by @Ksharp fixes it. So you have single CR or LF characters in a Windows-style terminated text file.
@kevinma28 How did you create the data set with the "shifted" values? Ideally the issue gets resolved as early in the process as possible.
Looks like something went wrong during importing data, so please post the code you have used to read the file and the log, so that we can help fixing the problem where it occurred.
Hello guys, thanks for the replies. I edited my question with more information.
This is what I get if I open the dataset at Excel. I believe there is no problem in my import code as this is a problem in the dataset.
When I filter it by neighbourhood_group, there should have been only 5 regions (Central, West, East, North, Northeast) in the 5th column. However I noticed that the filter have other values and I realized that it is the value from another column besides it. I need to clean the data so I want to shift the columns so at least some values can still be used for analysis.
I didn't see any shift record .everything is ok.
filename x 'c:\temp\listings.csv' encoding='utf8' termstr=crlf; proc import datafile=x out=have dbms=csv replace; getnames=yes; guessingrows=max; run;
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.