Python 如何仅提取 DataFrame 的两列值均为英语的 DataFrame 行

发布于06月07日

我有一个数据帧，它有27列，包括列FonctionsStagiaire和ExigencesParticulieres.数据帧有13774行，全部用英语或法语表示.CSV文件可在以下位置找到:GDrive link

我试图保留only行数据帧，其中FonctionsStagiaire或ExigencesParticulieres的值或内容是英语的.我想删除这些列包含法语值的entire行.

我使用的是langdetect，但收到的是错误langdetect.lang_detect_exception.LangDetectException: No features in text..我判断了所有的解决方案，因此要解决此错误，但什么都不起作用.

我还想判断这两列是否只包含NaN个值、只包含数字或空格字符或仅包含标点符号等，这些标点符号是无法检测到的，无论是英语还是法语，使用langdetect可能会导致错误.

当我try 查看哪一行抛出了基于这SO question的错误时，它显示每一行都有This row throws error.

我打算使用langdetect，但任何其他解决方案也将是有帮助的，只要它工作正常.

任何帮助都是非常感激的.

我正在try 使用以下代码:

from langdetect import detect
import pandas as pd
import re
import string

def filter_nonenglish(df):
     
    list = ['FonctionsStagiaire', 'ExigencesParticulieres']

    for col in list:       
        #to check if values are only whitespaces or only numeric characters or only punctuation marks
        #something like this -> if (df[col].apply(lambda x: x.isnumeric()) == True) | (df[col].apply(lambda x: x.isspace()) == True) | (all(i in string.punctuation for i in df[col]) == True):
            return False     
        else:
            #trying to use apply() to apply detect() to each <str> values of the columns and not to the entire pd.Series object
            new_df = df[(df[col].apply(detect).eq('en'))]

    return new_df
   
    #df['FonctionsStagiaire'] = df['FonctionsStagiaire'].apply(detect)
    #df['ExigencesParticulieres'] = df['ExigencesParticulieres'].apply(detect)
    
    #df = df[df['FonctionsStagiaire'] == 'en']
    #df = df[df['ExigencesParticulieres'] == 'en']
    
    #new_df = df[(df.FonctionsStagiaire.apply(detect).eq('en')) & (df.ExigencesParticulieres.apply(detect).eq('en'))]
    
df = pd.read_csv('emplois_df_parsed.csv')

df = df[df['FonctionsStagiaire'].notna() & df['ExigencesParticulieres'].notna()]   #to remove empty values

#to check whether all empty values are removed or not in the column 'FonctionsStagiaire'
#df['FonctionsStagiaire'] = df['FonctionsStagiaire'].str.lower()  
#df = df[df['FonctionsStagiaire'].str.islower()]

#to check whether all empty values are removed or not in the column 'ExigencesParticulieres'
#df['ExigencesParticulieres'] = df['ExigencesParticulieres'].str.lower()
#df = df[df['ExigencesParticulieres'].str.islower()]

#to make sure that values of both the columns are of <str> datatype
df['FonctionsStagiaire'] = pd.Series(df['FonctionsStagiaire'], dtype = "string")
df['ExigencesParticulieres'] = pd.Series(df['ExigencesParticulieres'], dtype = "string")

#bool(re.match('^(?=.*[a-zA-Z])', df.loc[:, 'FonctionsStagiaire']))
#bool(re.match('^(?=.*[a-zA-Z])', df.loc[:, 'ExigencesParticulieres']))

#   df[df[column].map(lambda x: x.isascii())]

df_new = filter_nonenglish(df)

df_new.to_csv('emplois_df_filtered.csv', index= False)

tmp = df.loc[[24], ["FonctionsStagiaire", "ExigencesParticulieres"]].T 24 FonctionsStagiaire https://www2.csrdn.qc.ca//files/jobs/P-22-875-... ExigencesParticulieres https://www2.csrdn.qc.ca//files/jobs/P-22-875-...

#pip install langid from langid import classify use_cols = ["FonctionsStagiaire", "ExigencesParticulieres"] # checking english content is_en = [ classify(r)[0] == "en" for r in df[use_cols].fillna("").agg(" ".join, axis=1) ] # not a null content not_na = df[use_cols].notna().all(axis=1) # not a random content (optional!) not_arc = df[use_cols].apply(lambda x: x.str.fullmatch("\w+\s?\d?")).any(axis=1) out = df.loc[is_en & not_na & ~not_arc]

print(out.loc[:, use_cols]) FonctionsStagiaire ExigencesParticulieres 11 As a Level I Technical Customer Serv... Qualifications Your contri... 106 What you'll create and do Today's... What you'll bring to this role: A... 140 The Training Department plays a cruc... REQUIREMENTS: \t Coll... ... ... ... 13608 CCHS Facility: Cleveland Clinic Cana... MINIMUM QUALIFICATIONS: · Registere... 13697 Calculate! Love numbers... 13698 Calculate! Love numbers... [311 rows x 2 columns]

Python 如何仅提取 DataFrame 的两列值均为英语的 DataFrame 行

推荐答案

Python相关问答推荐

如何最好地处理严重级联的json

无法获得指数曲线_fit来处理日期

如何在Pygame中绘制右对齐的文本？

从单个列创建多个列并按pandas分组

过滤绕轴旋转的螺旋桨

使用matplotlib pcolormesh，如何停止从一行绘制的磁贴连接到上下行？

请从Python访问kivy子部件的功能需要帮助

计算相同形状的两个张量的SSE损失

如何让 turtle 通过点击和拖动来绘制？

在内部列表上滚动窗口

将特定列信息移动到当前行下的新行

难以在Manim中正确定位对象

如何删除索引过go 的lexsort深度可能会影响性能？' &>

修复mypy错误-赋值中的类型不兼容(表达式具有类型xxx，变量具有类型yyy)

转换为浮点，pandas字符串列，混合千和十进制分隔符

Python逻辑操作作为Pandas中的条件

如何在Python中使用另一个数据框更改列值(列表)

将scipy. sparse矩阵直接保存为常规txt文件

幂集，其中每个元素可以是正或负""""

OpenCV轮廓.很难找到给定图像的所需轮廓