Reproducible Example
import pandas as pd
import numpy as np
index=pd.date_range('2019-12-31T00:10:00', '2020-01-31T00:10:00', freq='1T')
df = pd.DataFrame(np.zeros(len(index)), index=index)
grouper = pd.Grouper(freq='2D', closed='right')
sampled = df.groupby(grouper).sum()
Issue Description
The groupby() call creates a new Resampler object which builds the bins calling the _get_times_bins() function.
This function detects the first/last entries using the origin and the close semantic and calls the date_range() helper to get the bins intervals which, in our example, are the following:
DatetimeIndex(['2019-12-31',` '2020-01-02', '2020-01-04', '2020-01-06',
'2020-01-08', '2020-01-10', '2020-01-12', '2020-01-14',
'2020-01-16', '2020-01-18', '2020-01-20', '2020-01-22',
'2020-01-24', '2020-01-26', '2020-01-28', '2020-01-30',
'2020-02-01'],
dtype='datetime64[ns]', freq='2D')
at this point the function calls into _self.adjust_bin_edges(self, binner, ax_values) passing the bins and the actual source values.
def _adjust_bin_edges(self, binner, ax_values):
# Some hacks for > daily data, see #1471, #1458, #1483
if self.freq != "D" and is_superperiod(self.freq, "D"):
if self.closed == "right":
# GH 21459, GH 9119: Adjust the bins relative to the wall time
bin_edges = binner.tz_localize(None)
bin_edges = bin_edges + timedelta(1) - Nano(1)
bin_edges = bin_edges.tz_localize(binner.tz).asi8
else:
bin_edges = binner.asi8
# intraday values on last day
if bin_edges[-2] > ax_values.max():
bin_edges = bin_edges[:-1]
binner = binner[:-1]
else:
bin_edges = binner.asi8
return binner, bin_edges
which seems to have some old spaghetti-code which tries to fix old issues...
if the frequency is not a single Day (???) (like in our example freq='2D') and is_superperiod() function (which seems broken given that it checks only the base/role frequency instead of the whole one: aka it can't detect that 2D is not a superperiod of D or that 48H is semantically the same as 2D.. etc...) returns True the code enters this code segment which adds to every bin interval a whole day without the last nanosecond transforming the first entry into something like this:
DatetimeIndex(['2019-12-31 23:59:59.999999999',
'2020-01-02 23:59:59.999999999',
'2020-01-04 23:59:59.999999999',
...
...
'2020-02-01 23:59:59.999999999'],
dtype='datetime64[ns]', freq='2D')
which is totally nonsense given that the code later calls the generic cython lib.generate_bins_dt64() which compares the actual entries starting from '2019-12-31T00:10:00' to the first bin entry which is now a new 'edged' value '2019-12-31 23:59:59.999999999'. This in turn triggers the first sanity check since the first bin value is in the future in respect to the first value:
@cython.boundscheck(False)
@cython.wraparound(False)
def generate_bins_dt64(ndarray[int64_t] values, const int64_t[:] binner,
object closed='left', bint hasnans=False):
...
...
# check binner fits data
if values[0] < binner[0]:
raise ValueError("Values falls before first bin")
Expected Behavior
The groupby() function should return without any exception.
Installed Versions
Details
pd.show_versions()
INSTALLED VERSIONS
commit : 73c6825
python : 3.8.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-140-generic
Version : #144~16.04.1-Ubuntu SMP Fri Mar 19 21:24:12 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 54.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
Issue Description
The groupby() call creates a new Resampler object which builds the bins calling the _get_times_bins() function.
This function detects the first/last entries using the origin and the close semantic and calls the date_range() helper to get the bins intervals which, in our example, are the following:
at this point the function calls into _self.adjust_bin_edges(self, binner, ax_values) passing the bins and the actual source values.
which seems to have some old spaghetti-code which tries to fix old issues...
if the frequency is not a single Day (???) (like in our example freq='2D') and is_superperiod() function (which seems broken given that it checks only the base/role frequency instead of the whole one: aka it can't detect that 2D is not a superperiod of D or that 48H is semantically the same as 2D.. etc...) returns True the code enters this code segment which adds to every bin interval a whole day without the last nanosecond transforming the first entry into something like this:
which is totally nonsense given that the code later calls the generic cython lib.generate_bins_dt64() which compares the actual entries starting from '2019-12-31T00:10:00' to the first bin entry which is now a new 'edged' value '2019-12-31 23:59:59.999999999'. This in turn triggers the first sanity check since the first bin value is in the future in respect to the first value:
Expected Behavior
The groupby() function should return without any exception.
Installed Versions
Details
INSTALLED VERSIONS
commit : 73c6825
python : 3.8.8.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-140-generic
Version : #144~16.04.1-Ubuntu SMP Fri Mar 19 21:24:12 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 54.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None