当前位置: 动力学知识库 > 问答 > 编程问答 >

regex - Read names of variable complexity without delimiters, e.g. baseball players

问题描述:

I have an external data file like below, with no delimiters:

PLAYER TEAM STUFF1 STUFF2

Jim Smith NYY 100 200

Jerry Johnson Jr. PHI 100 200

Andrew C. James STL 200 200

A. J. Williams CWS 100 200

Felix Rodriguez BAL 100 100

How can I read this file? I am thinking of using readLines and splitting the string before any sequence of three consecutive capital letters. However, I do not know how to do it.

What if only the first letter of the team name was capitalized?

Below is a similar file in which a name is followed by a column of numbers. I can read these data with the code that follows:

 TEAM STUFF1 STUFF2

New York Yankees 100 200

Philadelphia Phillies 100 200

Boston Red Sox 200 200

Los Angeles Angels 100 200

Chicago White Sox 100 100

Chicago Cubs 200 100

New York Mets 200 200

San Francisco Giants 100 300

Minnesota Twins 100 300

St. Louis Cardinals 200 300

Here is the code to read the second data set:

setwd('c:/users/mmiller21/simple R programs/')

my.data3 <- readLines('team.names.with.spaces.txt')

# split between desired columns

my.data4 <- do.call(rbind, strsplit(my.data3, split = "(?<=[ ])(?=[0-9])", perl = T))

# returns string w/o leading or trailing whitespace

# This function is not mine and was found on Stack Overflow

trim <- function (x) gsub("^\\s+|\\s+$", "", x)

my.data5 <- trim(my.data4)

# remove header

my.data6 <- my.data5[-1,]

# convert to data.frame

my.data6 <- data.frame(my.data6, stringsAsFactors = FALSE)

my.data6[,2] <- as.numeric(my.data6[,2])

my.data6[,3] <- as.numeric(my.data6[,3])

my.data6

X1 X2 X3

1 New York Yankees 100 200

2 Philadelphia Phillies 100 200

3 Boston Red Sox 200 200

4 Los Angeles Angels 100 200

5 Chicago White Sox 100 100

6 Chicago Cubs 200 100

7 New York Mets 200 200

8 San Francisco Giants 100 300

9 Minnesota Twins 100 300

10 St. Louis Cardinals 200 300

Thank you for any advice. I prefer a solution in base R.

网友答案:

Here's a simple solution that satisfies your requirements. It is based on tokenizing by whitespace and reconstructing the name. It assumes the names are the only field that contains multiple tokens. It should be noted that the spacing may not be perfectly preserved and may not work correctly with embedded tabs instead of spaces:

library(stringr)
lines = readLines("team.names.with.spaces.txt");
for (line in lines[2:length(lines)]) {
    toks = strsplit(str_trim(line), " +")[[1]];
    ntoks = length(toks);
    name = paste(toks[1:(ntoks-3)], collapse=' ');
    team = toks[ntoks-2];
    num1 = as.integer(toks[ntoks-1]);
    num2 = as.integer(toks[ntoks]);
    print(line)
    print(name)
    print(team)
    print(num1)
    print(num2)
}

I do recommend using the str_trim() unless your files are always cleanly constructed, in which case you might be able to remove the stringr dependence. The output looks like this:

[1] "Jim Smith NYY    100  200"
[1] "Jim Smith"
[1] "NYY"
[1] 100
[1] 200
[1] "Jerry Johnson Jr. PHI    100  200"
[1] "Jerry Johnson Jr."
[1] "PHI"
[1] 100
[1] 200

As an alternative, you might use str_locate() to more stably deal with multiple spaces or punctuation in the name (hyphenated name of using a comma):

library(stringr)
x="Jerry Johnson Jr. PHI    100  200"
ndx = str_locate(x," +[A-Z]{3} +[0-9]+ +[0-9]+")[1]
name = substr(x,1,ndx-1);
网友答案:

This will split strings before three consecutive capital letters:

setwd('c:/users/mmiller21/simple R programs/')

my.data3 <- readLines('player.names.with.spaces.txt')

strsplit(my.data3, split = "(?<=[ ])(?=[A-Z]{3})", perl = T)

I can probably get the rest from there. Although I remain interested in how to read a file if only the first letter of the team name is capitalized.

Here is the result of the above code:

[[1]]
[1] "PLAYER  " "TEAM "    "STUFF1 "  "STUFF2"  

[[2]]
[1] "Jim Smith "      "NYY    100  200"

[[3]]
[1] "Jerry Johnson Jr. " "PHI    100  200"   

[[4]]
[1] "Andrew C. James  " "STL  200  200"    

[[5]]
[1] "A. J. Williams   " "CWS 100  200"     

[[6]]
[1] "Felix Rodriguez   " "BAL 100  100"      

Here is a solution if some team names contain three capital letters and others contain two capital letters, as with the following data set:

PLAYER  TEAM STUFF1 STUFF2
Jim Smith NYY    100  200
Jerry Johnson Jr. TB    100  200
Andrew C. James  STL 200  200
A. J. Williams   TB 100  200
Felix Rodriguez   CWS 100  100

my.data3 <- readLines('player.names.with.spaces3.txt')

strsplit(my.data3, split = "(?<=[ ])((?=[A-Z]{2})|(?=[A-Z]{3}))", perl = T)

In the event that team names are not all in capital letters, as with this data set:

PLAYER  TEAM STUFF1 STUFF2
Jim Smith NYY    100  200
Jerry Johnson Jr. Phi    100  200
Andrew C. James  StL  200  200
A. J. Williams   CWS 100  200
Felix Rodriguez   Bal 100  100

The following code seems to work, by using multiple splits:

setwd('c:/users/mmiller21/simple R programs/')

my.data3 <- readLines('player.names.with.spaces2.txt')

my.data4 <- strsplit(my.data3, split = "(?<=[ ])(?=[0-9])", perl = T)

my.data5 <- do.call(rbind, my.data4[])
my.data5 <- my.data5[-1,]

# returns string w/o leading or trailing whitespace

trim <- function (x) gsub("^\\s+|\\s+$", "", x)

my.data6 <- trim(my.data5)

my.data7 <- strsplit(my.data6[,1], ' (?=[^ ]+$)', perl=TRUE)

my.data8 <- do.call(rbind, my.data7[])

my.data9 <- trim(my.data8)

my.data10 <- cbind(my.data9, my.data6[,2:3])
my.data10

Here is the result:

     [,1]                [,2]  [,3]  [,4] 
[1,] "Jim Smith"         "NYY" "100" "200"
[2,] "Jerry Johnson Jr." "Phi" "100" "200"
[3,] "Andrew C. James"   "StL" "200" "200"
[4,] "A. J. Williams"    "CWS" "100" "200"
[5,] "Felix Rodriguez"   "Bal" "100" "100"
分享给朋友:
您可能感兴趣的文章:
随机阅读: