我有一个数据帧,它有27列,包括列FonctionsStagiaire
和ExigencesParticulieres
.数据帧有13774行,全部用英语或法语表示.CSV文件可在以下位置找到:GDrive link
我试图保留only行数据帧,其中FonctionsStagiaire
或ExigencesParticulieres
的值或内容是英语的.我想删除这些列包含法语值的entire行.
我使用的是langdetect
,但收到的是错误langdetect.lang_detect_exception.LangDetectException: No features in text.
.我判断了所有的解决方案,因此要解决此错误,但什么都不起作用.
我还想判断这两列是否只包含NaN
个值、只包含数字或空格字符或仅包含标点符号等,这些标点符号是无法检测到的,无论是英语还是法语,使用langdetect
可能会导致错误.
当我try 查看哪一行抛出了基于这SO question的错误时,它显示每一行都有This row throws error
.
我打算使用langdetect
,但任何其他解决方案也将是有帮助的,只要它工作正常.
任何帮助都是非常感激的.
我正在try 使用以下代码:
from langdetect import detect
import pandas as pd
import re
import string
def filter_nonenglish(df):
list = ['FonctionsStagiaire', 'ExigencesParticulieres']
for col in list:
#to check if values are only whitespaces or only numeric characters or only punctuation marks
#something like this -> if (df[col].apply(lambda x: x.isnumeric()) == True) | (df[col].apply(lambda x: x.isspace()) == True) | (all(i in string.punctuation for i in df[col]) == True):
return False
else:
#trying to use apply() to apply detect() to each <str> values of the columns and not to the entire pd.Series object
new_df = df[(df[col].apply(detect).eq('en'))]
return new_df
#df['FonctionsStagiaire'] = df['FonctionsStagiaire'].apply(detect)
#df['ExigencesParticulieres'] = df['ExigencesParticulieres'].apply(detect)
#df = df[df['FonctionsStagiaire'] == 'en']
#df = df[df['ExigencesParticulieres'] == 'en']
#new_df = df[(df.FonctionsStagiaire.apply(detect).eq('en')) & (df.ExigencesParticulieres.apply(detect).eq('en'))]
df = pd.read_csv('emplois_df_parsed.csv')
df = df[df['FonctionsStagiaire'].notna() & df['ExigencesParticulieres'].notna()] #to remove empty values
#to check whether all empty values are removed or not in the column 'FonctionsStagiaire'
#df['FonctionsStagiaire'] = df['FonctionsStagiaire'].str.lower()
#df = df[df['FonctionsStagiaire'].str.islower()]
#to check whether all empty values are removed or not in the column 'ExigencesParticulieres'
#df['ExigencesParticulieres'] = df['ExigencesParticulieres'].str.lower()
#df = df[df['ExigencesParticulieres'].str.islower()]
#to make sure that values of both the columns are of <str> datatype
df['FonctionsStagiaire'] = pd.Series(df['FonctionsStagiaire'], dtype = "string")
df['ExigencesParticulieres'] = pd.Series(df['ExigencesParticulieres'], dtype = "string")
#bool(re.match('^(?=.*[a-zA-Z])', df.loc[:, 'FonctionsStagiaire']))
#bool(re.match('^(?=.*[a-zA-Z])', df.loc[:, 'ExigencesParticulieres']))
# df[df[column].map(lambda x: x.isascii())]
df_new = filter_nonenglish(df)
df_new.to_csv('emplois_df_filtered.csv', index= False)