当前位置: 动力学知识库 > 问答 > 编程问答 >

pyspark sql - Behavior of 'Column' object within function spark

问题描述:

I am writting a code to replace characters with the following pathern : [^\w | ] with '' . The point is that when use the DataFrame 'sentenceDF' within my function 'removePunctuation' i get the following error 'column' object is not callable'.

from pyspark.sql.functions import regexp_replace, trim, col, lower

def removePunctuation(column):

cleanString = column

cleanString = cleanString.select(regexp_replace(sentenceDF['sentence'],'[^\w | ]','').alias('sentence'))

cleanString = cleanString.select(regexp_replace(cleanString['sentence'],'_','').alias('sentence'))

cleanString = cleanString.select(lower(cleanString['sentence']))

return cleanString

sentenceDF = sqlContext.createDataFrame([('Hi, you!',),

(' No under_score!',),

(' * Remove punctuation then spaces * ',)], ['sentence'])

result = sentenceDF.select(removePunctuation(col('sentence')))

result.show()

TraceBack :

 TypeError: 'Column' object is not callable

--------------------------------------------------------------------------- TypeError Traceback (most recent call last)

<ipython-input-50-aa978fac8bae> in <module>()

15 (' * Remove punctuation then spaces * ',)], ['sentence'])

16

---> 17 result = sentenceDF.select(removePunctuation(col('sentence')))

18 result.show()

<ipython-input-50-aa978fac8bae> in removePunctuation(column)

4 def removePunctuation(column):

5 cleanString = column

----> 6 cleanString = cleanString.select(regexp_replace(sentenceDF['sentence'],'[^\w | ]','').alias('sentence'))

7 cleanString = cleanString.select(regexp_replace(cleanString['sentence'],'_','').alias('sentence'))

8 cleanString = cleanString.select(lower(cleanString['sentence'])) TypeError: 'Column' object is not callable

Command took 0.09 seconds -- by [email protected] at 10/30/2016, 2:48:17 PM on My Cluster (6 GB)

网友答案:

Just do this - You get the same error.

col('sentence').select()

Suggestion: Always try to write the code out before you refactor to functions.

Anyways, here's what you want, I think.

def removePunctuation(df, column):
    cleanString = df.select(trim(lower(col('sentence'))).alias('sentence'))
    cleanString = cleanString.select(regexp_replace('sentence','[^\w]|\s+|_','').alias('sentence'))

    return cleanString

result = removePunctuation(sentenceDF, 'sentence')
result.show()

+--------------------+
|            sentence|
+--------------------+
|               hiyou|
|        nounderscore|
|removepunctuation...|
+--------------------+
分享给朋友:
您可能感兴趣的文章:
随机阅读: