Home  >  Article  >  Backend Development  >  How to create columns based on other DataFrame row filters?

How to create columns based on other DataFrame row filters?

WBOY
WBOYforward
2024-02-09 13:30:04537browse

如何基于其他 DataFrame 行过滤器创建列?

Question content

I have a lazyframe called "hourly_data" which contains an hourly datetime column called "time". I also have a dataframe called "future_periods" which contains two datetime columns called "start" (start datetime of future period) and "end" (end time of future period). Importantly, these future periods do not overlap.

I want to create a column called "period" for hourly_data lazyframe, it should have an int value based on which period (future_periods dataframe row, from 0 to 9 if there are 10 periods) time column value The value of hourly_data is between the start and end column values ​​of future_periods.

I tried doing the following:

periods = pl.series(range(future_periods.height))
hourly_data = hourly_data.with_columns(
    (
        pl.when(((future_periods.get_column('start') <= pl.col('time')) & (pl.col('time') <= future_periods.get_column('end'))).any())
        .then(periods.filter(pl.series((future_periods.get_column('start') <= pl.col('real_time')) & (pl.col('real_time') <= future_periods.get_column('end')))).to_list()[0])
        .otherwise(none)
    ).alias('period')
)

But this gives me the error: typeerror: Calling series constructor with unsupported type 'expr' for values argument

What I want to achieve: enter:

hourly_data:
┌────────────────────┐
│ time               │
│ ---                │
│ datetime           │
╞════════════════════╡
│ 2024-01-01 00:00:00│
│ 2024-01-01 01:00:00│
│ 2024-01-01 02:00:00│
│         ...        │
│ 2024-03-31 23:00:00│
│ 2024-04-01 00:00:00│
│ 2024-04-01 01:00:00│
│         ...        │
│ 2024-06-01 00:00:00│
└────────────────────┘
future_periods:
┌─────────────────────────┬───────────────────────┐
│ start                   ┆ end                   │
│ ---                     ┆ ---                   │
│ datetime                ┆ datetime              │
╞═════════════════════════╪═══════════════════════╡
│ 2024-01-01 00:00:00     ┆ 2024-01-31 23:00:00   │
│ 2024-02-01 00:00:00     ┆ 2024-02-28 23:00:00   │
│ 2024-03-01 00:00:00     ┆ 2024-03-31 23:00:00   │
│ 2024-04-01 00:00:00     ┆ 2024-05-31 23:00:00   │
└─────────────────────────┴───────────────────────┘

Output:

hourly_data:
┌─────────────────────────┬────────┐
│ time                    ┆ period │
│ ---                     ┆ ---    │
│ datetime                ┆ int    │
╞═════════════════════════╪════════╡
│ 2024-01-01 00:00:00     ┆ 0      │
│ 2024-01-01 01:00:00     ┆ 0      │
│ 2024-01-01 02:00:00     ┆ 0      │
│          ...            ┆ ...    │
│ 2024-03-31 23:00:00     ┆ 2      │
│ 2024-04-01 00:00:00     ┆ 3      │
│ 2024-04-01 01:00:00     ┆ 3      │
│          ...            ┆ ...    │
│ 2024-06-01 00:00:00     ┆ None   │
└─────────────────────────┴────────┘

Correct answer


Generally speaking, it is an inequality join, or in your case, a range join. Here's one way to do it. Let's first create some sample data:

hourly_data = pl.dataframe({
    "time": ['2023-01-01 14:00','2023-01-02 09:00', '2023-01-04 11:00']
}).lazy()

future_periods = pl.dataframe({
    "id": [1,2,3,4],
    "start": ['2023-01-01 11:00','2023-01-02 10:00', '2023-01-03 15:00', '2023-01-04 10:00'],
    "end": ['2023-01-01 16:00','2023-01-02 11:00', '2023-01-03 18:00', '2023-01-04 15:00']
}).lazy()

┌──────────────────┬──────┐
│ time             ┆ data │
│ ---              ┆ ---  │
│ str              ┆ str  │
╞══════════════════╪══════╡
│ 2023-01-01 14:00 ┆ a    │
│ 2023-01-02 09:00 ┆ b    │
│ 2023-01-04 11:00 ┆ c    │
└──────────────────┴──────┘ 
┌─────┬──────────────────┬──────────────────┐
│ id  ┆ start            ┆ end              │
│ --- ┆ ---              ┆ ---              │
│ i64 ┆ str              ┆ str              │
╞═════╪══════════════════╪══════════════════╡
│ 1   ┆ 2023-01-01 11:00 ┆ 2023-01-01 16:00 │
│ 2   ┆ 2023-01-02 10:00 ┆ 2023-01-02 11:00 │
│ 3   ┆ 2023-01-03 15:00 ┆ 2023-01-03 18:00 │
│ 4   ┆ 2023-01-04 10:00 ┆ 2023-01-04 15:00 │
└─────┴──────────────────┴──────────────────┘

Now you can do it in two steps - first, calculate the link between time and the future period id:

time_periods = (
   hourly_data
       .join(future_periods, how="cross")
       .filter(
           pl.col("time") > pl.col("start"),
           pl.col("time") < pl.col("end")
        ).select(["time","id"])
)

┌──────────────────┬─────┐
│ time             ┆ id  │
│ ---              ┆ --- │
│ str              ┆ i64 │
╞══════════════════╪═════╡
│ 2023-01-01 14:00 ┆ 1   │
│ 2023-01-04 11:00 ┆ 4   │
└──────────────────┴─────┘

You can then join it with the original data frame:

hourly_data.join(time_periods, how="left", on="time").collect()

┌──────────────────┬──────┬──────┐
│ time             ┆ data ┆ id   │
│ ---              ┆ ---  ┆ ---  │
│ str              ┆ str  ┆ i64  │
╞══════════════════╪══════╪══════╡
│ 2023-01-01 14:00 ┆ a    ┆ 1    │
│ 2023-01-02 09:00 ┆ b    ┆ null │
│ 2023-01-04 11:00 ┆ c    ┆ 4    │
└──────────────────┴──────┴──────┘

Another way to do this might be to use duckdb Thanks for integrating with polars:

import duckdb
import polars as pl

duckdb.sql("""
    select
        h.time, h.data, p.id
    from hourly_data as h
        left join future_periods as p on
            p.start < h.time and
            p.end > h.time
""").pl()

┌──────────────────┬──────┬──────┐
│ time             ┆ data ┆ id   │
│ ---              ┆ ---  ┆ ---  │
│ str              ┆ str  ┆ i64  │
╞══════════════════╪══════╪══════╡
│ 2023-01-01 14:00 ┆ A    ┆ 1    │
│ 2023-01-04 11:00 ┆ C    ┆ 4    │
│ 2023-01-02 09:00 ┆ B    ┆ null │
└──────────────────┴──────┴──────┘

The above is the detailed content of How to create columns based on other DataFrame row filters?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:stackoverflow.com. If there is any infringement, please contact admin@php.cn delete