我有一个包含不同房产租金价格的数据集.它看起来是这样的:

data = {
    'prices': [
        '$350.00',
        '$450.00 pw',
        '$325 per week',
        '$495pw - Views! White goods!',
        '$460p/w + gst and outgoings',
        '$300 wk',
        '$390pw / $1695pcm',
        '$180 pw / $782 pm',
        '$375 Per Week/Fully Furnished',
        '$350 pw + GST & Outgoings',
        'APPLY NOW - From $270 per week',
        '$185 per night',
        '$400pw incl. power',
        '$500 weekly',
        '$600 per week pw',
        '$850 per week (Fully furnished)',
        'FROM $400PW, FURNITURE AND BILLS INCLUDED',
        'THE DEAL- $780 PER WEEK',
        'THE DEAL: $1,400 PER WEEK',
        '$750/W Unfurnished',
        '$320 - fully furnished pw',
        '$330 PER WEEK | $1,430 P.C.M',
        'Enquire Now: $690 per week',
        '$460 per week / $1999 per month',
        '$490 per week/Under Application approved',
        '$1550pw - Location! Rare gem!',
        '295 per week',  # Example without a dollar sign
        'unit 2 - $780pw unit 3 - $760pw',  # Example with multiple prices
        '$2500 pw high, $1600pw low,$380 pn',  # Example with multiple prices
        'from $786 - $1572 per week',  # Example with multiple prices
        '$590 to $639',  # Example with a range
        '$280 - $290 pw'  # Example with a range
    ]
}

我的目标是清理这个"价格"列,以便只显示每周的租金价格.

我未能管理最后五种数据,以下是我所做的:

df = pd.DataFrame(data)

def extract_weekly_price(text):
    price_match = re.search(r'\$?([\d,]+)', text)
    if price_match:
        price_str = price_match.group(1)
        price = int(price_str.replace(',', ''))
        
        # convert to weekly if not
        if re.search(r'(per week|p\.w\.|p/w|pw|/w|weekly)', text):
            return price  
        elif 'p.a' in text:
            return price / 52  
        elif re.search(r'(p\.c\.m|pcm|mth|pm)', text):
            return price / 4.33 
        elif 'per night' in text:
            return price * 7
        else:
            return price  
    else:
        return None

df['prices'] = df['prices'].str.lower()
df['Weekly_rent'] = df['prices'].apply(extract_weekly_price).round(3)

我如何修改我的代码,以便我可以获得这些数据的平均每周价格,范围如‘590美元到639美元’或‘$280-$290 PW’?如果你能帮忙,我将不胜感激.

推荐答案

您可以测试此方法,并查看是否正确计算了平均价格:

import numpy as np

pat = (
    r"(?i)"
    r"(\d+[.,]?\d+)\s*" # prices
    r"(?:-.*\bfurnished\s*)?" # optional text
    r"(p/?w|wdk|per week|weekly|/w|" # weekly
    r"p\.?c\.?m|mth|pm|per month|" # monthly
    r"per night|pn|" # daily
    r"p.a)?" # ??
)

tmp = df["prices"].str.extractall(pat)
fn = lambda x: tmp[1].str.contains(x, case=False, na=False)
s0 = tmp[0].replace(",", "", regex=True).astype(float)

averge = np.select(
    [fn("w"), fn("n"), fn("m"), fn("p.a")],
    [s0, s0.mul(7), s0.div(4.33), s0.div(52)], default=s0
)

out = (
    df[["prices"]].join(
        tmp.assign(all_prices=averge.round(2)).groupby(level=0)
            .agg(
                computed_prices=("all_prices", list), # optional
                average=("all_prices", "mean")
            )
    )
)

正则表达式:[demo]

发帖主题:Re:Kolibrios

print(out)


                                       prices           computed_prices  average
0                                     $350.00                   [350.0]   350.00
1                                  $450.00 pw                   [450.0]   450.00
2                               $325 per week                   [325.0]   325.00
3                $495pw - Views! White goods!                   [495.0]   495.00
4                 $460p/w + gst and outgoings                   [460.0]   460.00
5                                     $300 wk                   [300.0]   300.00
6                           $390pw / $1695pcm           [390.0, 391.45]   390.73
7                           $180 pw / $782 pm            [180.0, 180.6]   180.30
8               $375 Per Week/Fully Furnished                   [375.0]   375.00
9                   $350 pw + GST & Outgoings                   [350.0]   350.00
10             APPLY NOW - From $270 per week                   [270.0]   270.00
11                             $185 per night                  [1295.0]  1295.00
12                         $400pw incl. power                   [400.0]   400.00
13                                $500 weekly                   [500.0]   500.00
14                           $600 per week pw                   [600.0]   600.00
15            $850 per week (Fully furnished)                   [850.0]   850.00
16  FROM $400PW, FURNITURE AND BILLS INCLUDED                   [400.0]   400.00
17                    THE DEAL- $780 PER WEEK                   [780.0]   780.00
18                  THE DEAL: $1,400 PER WEEK                  [1400.0]  1400.00
19                         $750/W Unfurnished                   [750.0]   750.00
20                  $320 - fully furnished pw                   [320.0]   320.00
21               $330 PER WEEK | $1,430 P.C.M           [330.0, 1430.0]   880.00
22                 Enquire Now: $690 per week                   [690.0]   690.00
23            $460 per week / $1999 per month          [460.0, 13993.0]  7226.50
24   $490 per week/Under Application approved                   [490.0]   490.00
25              $1550pw - Location! Rare gem!                  [1550.0]  1550.00
26                               295 per week                   [295.0]   295.00
27            unit 2 - $780pw unit 3 - $760pw            [780.0, 760.0]   770.00
28         $2500 pw high, $1600pw low,$380 pn  [2500.0, 1600.0, 2660.0]  2253.33
29                 from $786 - $1572 per week           [786.0, 1572.0]  1179.00
30                               $590 to $639            [590.0, 639.0]   614.50
31                             $280 - $290 pw            [280.0, 290.0]   285.00

Python相关问答推荐

这些变量是否相等,因为它们引用相同的实例,尽管它们看起来应该具有不同的值?

正在设置字段.需要为假,因为错误列表索引必须是整数或切片,而不是字符串

回归回溯-2D数组中的单词搜索

Flask主机持续 bootstrap 本地IP| Python

使用Python C API重新启动Python解释器

code _tkinter. Tcl错误:窗口路径名称错误.!按钮4"

Tkinter滑动条标签.我不确定如何删除滑动块标签或更改其文本

如何根据条件在多指标框架上进行groupby

将HTML输出转换为表格中的问题

Chatgpt API不断返回错误:404未能从API获取响应

如何让 turtle 通过点击和拖动来绘制?

将numpy数组存储在原始二进制文件中

重新匹配{ }中包含的文本,其中文本可能包含{{var}

非常奇怪:tzLocal.get_Localzone()基于python3别名的不同输出?

如何使用根据其他值相似的列从列表中获取的中间值填充空NaN数据

梯度下降:简化要素集的运行时间比原始要素集长

如何使Matplotlib标题以图形为中心,而图例框则以图形为中心

如何在turtle中不使用write()来绘制填充字母(例如OEG)

为什么if2/if3会提供两种不同的输出?

为什么调用函数的值和次数不同,递归在代码中是如何工作的?