BookmarkSubscribeRSS Feed
pingsu
Calcite | Level 5

My data set is very messy such as the data starts from 8 row, lots space , lots comas, 55 variables and 7 lines for a title.

Before I  use import the data , do I have to clean it? or  I import it first and then clean the data in SAS.

 

and I have lots errors.

 

proc import datafile ='/home/inkyung/repos/lasso/data/0001.csv'
            out=tax00
            dbms=csv replace;
            delimiter=',';
            getnames=yes;
            guessingrows=1000;
run;
proc print data=tax00;
run;




 

5 REPLIES 5
paulkaefer
Lapis Lazuli | Level 10

This depends on your data and how much you want SAS to do. For one, you can use DSD and MISSOVER if you use a DATA step. The DSD option will help if you have several commas in a row that signify missing values. This paper is a good intro to some options.

 

How does the code you have now work? If you still need help, can you post a small example of 0001.csv, including some of the "worst" examples you are seeing? Maybe including what looks wrong in the output would help.

pingsu
Calcite | Level 5

It doesn't work.

 

The data set is as follows:

 

,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,Table 3: STATE TAX COLLECTIONS BY STATE AND TYPE OF TAX,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,(thousands of dollars),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,Year: 2000,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,"Quarter: 1 (January, February, March)",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,Revision 1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,U.S. Total,Alabama,Alaska,Arizona,Arkansas,California,Colorado,Connecticut,Delaware,Florida,Georgia,Hawaii,Idaho,Illinois,Indiana,Iowa,Kansas,Kentucky,Louisiana,Maine,Maryland,Massachusetts,Michigan,Minnesota,Mississippi,Missouri,Montana,Nebraska,Nevada,New Hampshire,New Jersey,New Mexico,New York,North Carolina,North Dakota,Ohio,Oklahoma,Oregon,Pennsylvania,Rhode Island,South Carolina,South Dakota,Tennessee,Texas,Utah,Vermont,Virginia,Washington,West Virginia,Wisconsin,Wyoming,Wash. DC
,,,Excludes D.C.,,,Revised,Revised,,,Revised,Revised,Revised,,,Revised,Revised,Revised,,,Revised,Revised,,,,,,,Revised,Est.,,Revised,,,Revised,,,,,Revised,,,,,Revised,,,,Revised,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,Total Taxes,,133987401,1554648,204757,1922171,1135907,20206217,1614872,2219791,533972,6664802,3295666,860653,539018,6142447,2492170,1363449,1104084,1726561,1494620,583932,2267078,4055999,6827675,3719413,930371,1879749,317540,722445,626221,287934,4029776,831934,11762936,3695397,274030,5757836,1354611,1307216,6049117,508208,1325561,190918,1925580,6503198,867827,341961,2786879,3131704,765184,3069464,213902,628645

paulkaefer
Lapis Lazuli | Level 10

Thanks for providing the sample. Will you be reading in many of these, or just the one?

 

You can certainly do this with a data step, though it will probably be best to read in line-by-line. You can use the scan function, for example, to parse the input strings. The DATA step will allow you to output to different data sets, so you could output state names to a temporary set, the numeric values to another, and append them. This is difficult because your data doesn't follow typical CSV patterns.

 

However, it may be easier to reformat the set, especially if you are only reading in the one set. I copied and pasted the data you posted into Excel. I went to Data >> Text to Columns and selected the options of delimeter, comma. Then I formatted it like this:

State	Note	Total Taxes	Year	Quarter	Revision
U.S. Total	Excludes D.C.	133987401	2000	1	1
Alabama		1554648	2000	1	1
Alaska		204757	2000	1	1
Arizona	Revised	1922171	2000	1	1
Arkansas	Revised	1135907	2000	1	1
California		20206217	2000	1	1
Colorado		1614872	2000	1	1
Connecticut	Revised	2219791	2000	1	1
Delaware	Revised	533972	2000	1	1
Florida	Revised	6664802	2000	1	1
Georgia		3295666	2000	1	1
Hawaii		860653	2000	1	1
Idaho	Revised	539018	2000	1	1
Illinois	Revised	6142447	2000	1	1
Indiana	Revised	2492170	2000	1	1
Iowa		1363449	2000	1	1
Kansas		1104084	2000	1	1
Kentucky	Revised	1726561	2000	1	1
Louisiana	Revised	1494620	2000	1	1
Maine		583932	2000	1	1
Maryland		2267078	2000	1	1
Massachusetts		4055999	2000	1	1
Michigan		6827675	2000	1	1
Minnesota		3719413	2000	1	1
Mississippi		930371	2000	1	1
Missouri	Revised	1879749	2000	1	1
Montana	Est.	317540	2000	1	1
Nebraska		722445	2000	1	1
Nevada	Revised	626221	2000	1	1
New Hampshire		287934	2000	1	1
New Jersey		4029776	2000	1	1
New Mexico	Revised	831934	2000	1	1
New York		11762936	2000	1	1
North Carolina		3695397	2000	1	1
North Dakota		274030	2000	1	1
Ohio		5757836	2000	1	1
Oklahoma	Revised	1354611	2000	1	1
Oregon		1307216	2000	1	1
Pennsylvania		6049117	2000	1	1
Rhode Island		508208	2000	1	1
South Carolina		1325561	2000	1	1
South Dakota	Revised	190918	2000	1	1
Tennessee		1925580	2000	1	1
Texas		6503198	2000	1	1
Utah		867827	2000	1	1
Vermont	Revised	341961	2000	1	1
Virginia		2786879	2000	1	1
Washington		3131704	2000	1	1
West Virginia		765184	2000	1	1
Wisconsin		3069464	2000	1	1
Wyoming		213902	2000	1	1
Wash. DC		628645	2000	1	1

The formatting looks off here, but you should be able to copy into Excel. Then PROC import is trivial. Or format it yourself to suit your purpose.

pingsu
Calcite | Level 5

I have to merge 20 files, so I tried to do, I failed it. I don't want to use EXCEL. Usually, I use the R, I call the text itself from raw data, I can edit them such as delete several rows and column, fix the space etc. I want to find that function in SAS to edit dataset.

ballardw
Super User

If you go through one of the import wizards you will note that you have an option to indicate which line the data actually starts on. The option DATAROW can say to start reading the data on a specific row. Add

 

datarow=8;

 

The issue you may have is that we don't have a similar option to tell which row to read the variable names. Getnames assumes the variable names are on row 1. So you may gett messages about invalid variable names and the first row of your title will likely get truncated to 32 characters and used as a varaible name.

 

Sometimes the best thing to do use is proc import to generate code you can modify. The log will contain a data step that attempts to read the data using the information you provided to Proc Import. Copy it into the editor and modify to your hearts content. You may do things like search and replace to change the variable names, modify informats and formats, add labels and even the first step of any data modifications.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1044 views
  • 2 likes
  • 3 in conversation