Learning SAS? Welcome to the exclusive online community for all SAS learners.

How to read Data with multiple delimiter

Accepted Solution Solved
Reply
New Contributor
Posts: 4
Accepted Solution

How to read Data with multiple delimiter

I have a data file as "movies.dat" file with data on movies from the MovieLens Website.

I have converted this .dat file to text file for sake of simplicity and to have a look on data.

After than i took only 10 observation to write a sample code for have a test for reading and printing the data.

Converted text file and abridged text file with 10 entries are in attachment.


It contains delimiters ":" and "|".

The data as stored in abridged text file is as following.

1::Toy Story (1995)::Animation|Children's|Comedy

2::Jumanji (1995)::Adventure|Children's|Fantasy

3::Grumpier Old Men (1995)::Comedy|Romance

4::Waiting to Exhale (1995)::Comedy|Drama

5::Father of the Bride Part II (1995)::Comedy

6::Heat (1995)::Action|Crime|Thriller

7:Smiley Frustratedabrina (1995)::Comedy|Romance

8::Tom and Huck (1995)::Adventure|Children's

9:Smiley Frustratedudden Death (1995)::Action

10::GoldenEye (1995)::Action|Adventure|Thriller

I want to read the data as following

MovieIDTitleGenres1Genres2Genres3
1Toy Story(1995)AnimationChildren'sComedy
2Jumanji(1995)AdventureChildren'sFantasy
3Grumpier Old Men(1995)ComedyRomance
4Waiting to Exhale(1995)ComedyDrama
5Father of the Bride Part II(1995)Comedy
6Heat(1995)ActionCrimeThriller
7Sabrina(1995)ComedyRomance
8Tom and Huck(1995)AdventureChildren's
9Sudden Death(1995)Action
10GoldenEye(1995)ActionAdventureThriller

I wrote a code in SAS as following with the "Movies Data2.txt" text file saved on my desktop.

data TEST;

infile "C:\Users\gautam\Desktop\Movies Data2.txt" delimiter = 'Smiley Indifferent' MISSOVER DSD;

Input MoviesId Title $ Genre1-Genre5 $;

Run;

Proc Print Data =  TEST;

Run;

However the data is not being read and missing values are assigned.

Obs

MoviesId

Title

Genre1

Genre2

Genre3

Genre4

Genre5

1

1

.

.

.

.

Comedy

2

2

.

.

.

.

Fantasy

3

3

.

.

.

.

4

4

.

.

.

.

5

5

.

.

.

.

6

6

.

.

.

.

Thriller

7

7

.

.

.

.

8

8

.

.

.

.

9

9

.

.

.

.

10

10

.

.

.

.

Thriller

Kindly help me to write a code to read this data.

I am taking these assumptions.

1. Data is available only in text format

2. Only SAS environment is available for reading and analyzing the data.

3. As the original data is very large hence, practically i dont know how many characters the deTitle of movie in data sets acquires. It may range anywhere greater than one character.

Please guide.

Attachment
Attachment

Accepted Solutions
Solution
‎07-21-2015 07:39 PM
Super User
Posts: 17,784

Re: How to read Data with multiple delimiter

[ Edited ]

@Patrick You're correct and Thanks. It worked with the test data, but if a movie had a colon in the name this would be an issue.

 

data test;
informat MoviesID 8. Title $100. Genre $100.;
infile "C:\temp\movies Data.txt" dlmstr = '::' truncover ;
Input MoviesId Title $ Genre $;
Run;
 
 
data test2;
set test;
array gen(5) $50.;
 
 
ngen=countw(genre, "|");
do i=1 to ngen;
gen(i)=scan(genre, i, "|");
end;
 
 
drop i ngen;
run;

 

View solution in original post


All Replies
Respected Advisor
Posts: 3,887

Re: How to read Data with multiple delimiter

What about using "dlmstr" SAS(R) 9.4 Statements: Reference, Fourth Edition

data TEST(drop=_Smiley Happy;

  infile datalines dlmstr='::'  truncover DSD;

  Input MoviesId $ Title :$30. _Genre_Str :$100.;

  array genre {5} $20.;

  do _i=1 to dim(genre);

    genre[_i]=scan(_genre_str,_i,'|');

    if missing(genre[_i]) then leave;

  end;

  datalines;

1::Toy Story (1995)::Animation|Children's|Comedy

2::Jumanji (1995)::Adventure|Children's|Fantasy

3::Grumpier Old Men (1995)::Comedy|Romance

4::Waiting to Exhale (1995)::Comedy|Drama

5::Father of the Bride Part II (1995)::Comedy

6::Heat (1995)::Action|Crime|Thriller

7:Smiley Frustratedabrina (1995)::Comedy|Romance

8::Tom and Huck (1995)::Adventure|Children's

9:Smiley Frustratedudden Death (1995)::Action

10::GoldenEye (1995)::Action|Adventure|Thriller

;

run;

New Contributor
Posts: 4

Re: How to read Data with multiple delimiter

Hello Patrick,

Thanks for replying.

Code has done pretty well. It also has differentiated the semicolons present in the names of movie in the master data set. For example for the name of "Crow: City of Angels, The (1996)". in master dataset it has read it perfectly.

I just increased the number of characters in Title to read to 100 in the informat statement. < Input MoviesId $ Title :$100. _Genre_Str :$100.; >

All works like charm.

Super User
Posts: 17,784

Re: How to read Data with multiple delimiter

Here's a bad way most likely but definitley works.

Your delimiters are not interchangeable - a good thing. Use the : to separate out the variables and a second pass to separate out the genres. This assumes you have 5 genres, if you have more than 5 increase your array.

data TEST;

informat MoviesID 8. Title $100. Genre $100.;

infile "C:\temp\movies Data.txt" delimiter = '::' truncover ;

Input MoviesId Title $ Genre $;

Run;

data test2;

set test;

array gen(5) $50.;

ngen=countw(genre, "|");

do i=1 to ngen;

gen(i)=scan(genre, i, "|");

end;

drop i ngen;

run;

Respected Advisor
Posts: 3,887

Re: How to read Data with multiple delimiter

I believe you need to use "dlmstr" instead of "delimiter" as else a single ":" will also be treated as delimiter.

Solution
‎07-21-2015 07:39 PM
Super User
Posts: 17,784

Re: How to read Data with multiple delimiter

[ Edited ]

@Patrick You're correct and Thanks. It worked with the test data, but if a movie had a colon in the name this would be an issue.

 

data test;
informat MoviesID 8. Title $100. Genre $100.;
infile "C:\temp\movies Data.txt" dlmstr = '::' truncover ;
Input MoviesId Title $ Genre $;
Run;
 
 
data test2;
set test;
array gen(5) $50.;
 
 
ngen=countw(genre, "|");
do i=1 to ngen;
gen(i)=scan(genre, i, "|");
end;
 
 
drop i ngen;
run;

 

New Contributor
Posts: 4

Re: How to read Data with multiple delimiter

Works perfectly even with master data set on movies.

New Contributor
Posts: 4

Re: How to read Data with multiple delimiter

Just there is a little error in pasting in your code for Initial Data keyword.

Data test; If you can edit it for future use would be apt.

Super User
Posts: 10,483

Re: How to read Data with multiple delimiter

I think you'd be better off reading the genre as one variable and then parsing into specific variables. I would create a number of 0/1 for No/yes variables to indicate Action, Adventure, Comedy, Drama etc in preference to a genre1 that would end up with all of the values and need to be manipulated later.

data TEST;

infile "C:\Users\gautam\Desktop\Movies Data2.txt" dlmstr = '::' MISSOVER DSD;
informat MoviesId best8. dummy  $2. title dummy2 $50.   genre dummy3 $50.;
informat MoviesId best8.  title  $50.   genre $50.;
Input MoviesId  Title $  Genre ;

Comedy = (index(genre,'Comedy') >0); /* as one example to create the 0/1 variables*/
Run;

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 8 replies
  • 559 views
  • 9 likes
  • 4 in conversation