[rpms/python-hdmf] rawhide: Backport support for Pandas 3; fixes RHBZ#2481162

public inbox for git-commits@fedoraproject.org
help / color / mirror / Atom feed

From: Benjamin A. Beasley <code@musicinmybrain.net>
To: git-commits@fedoraproject.org
Subject: [rpms/python-hdmf] rawhide: Backport support for Pandas 3; fixes RHBZ#2481162
Date: Thu, 25 Jun 2026 10:30:18 GMT	[thread overview]
Message-ID: <178238341899.1.5316945448875546916.rpms-python-hdmf-8b53fc51bb70@fedoraproject.org> (raw)

A new commit has been pushed.

Repo   : rpms/python-hdmf
Branch : rawhide
Commit : 8b53fc51bb7013d8173218a6a45efbc2fdcf07a5
Author : Benjamin A. Beasley <code@musicinmybrain.net>
Date   : 2026-06-25T11:04:57+01:00
Stats  : +261/-1 in 2 file(s)
URL    : https://src.fedoraproject.org/rpms/python-hdmf/c/8b53fc51bb7013d8173218a6a45efbc2fdcf07a5?branch=rawhide

Log:
Backport support for Pandas 3; fixes RHBZ#2481162

---
diff --git a/0001-Accept-pandas-Series-ExtensionArray-for-Data-lift-pa.patch b/0001-Accept-pandas-Series-ExtensionArray-for-Data-lift-pa.patch
new file mode 100644
index 0000000..0555d77
--- /dev/null
+++ b/0001-Accept-pandas-Series-ExtensionArray-for-Data-lift-pa.patch
@@ -0,0 +1,254 @@
+From 51efa8c9f0e56721c29cc64aacd1ee8c74e92876 Mon Sep 17 00:00:00 2001
+From: Ryan Ly <310197+rly@users.noreply.github.com>
+Date: Wed, 24 Jun 2026 19:13:28 -0700
+Subject: [PATCH] Accept pandas Series/ExtensionArray for Data; lift pandas<3
+ cap (#1469)
+
+Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
+Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
+---
+ docs/source/conf.py                 |  1 +
+ pyproject.toml                      |  2 +-
+ src/hdmf/container.py               |  4 +-
+ src/hdmf/utils.py                   | 45 +++++++++++++-
+ tests/unit/utils_test/test_utils.py | 94 ++++++++++++++++++++++++++++-
+ 5 files changed, 142 insertions(+), 4 deletions(-)
+
+diff --git a/docs/source/conf.py b/docs/source/conf.py
+index 0ed71852..3b2b18fa 100644
+--- a/docs/source/conf.py
++++ b/docs/source/conf.py
+@@ -90,6 +90,7 @@ nitpick_ignore = [('py:class', 'Intracomm'),
+                   ('py:class', 'h5py._hl.dataset.Dataset'),
+                   ('py:class', 'function'),
+                   ('py:class', 'unittest.case.TestCase'),
++                  ('py:class', 'pandas.ExtensionArray'),
+                   ]
+ 
+ suppress_warnings = ["config.cache"]
+diff --git a/pyproject.toml b/pyproject.toml
+index 13b7aaf9..15b89cf2 100644
+--- a/pyproject.toml
++++ b/pyproject.toml
+@@ -35,7 +35,7 @@ dependencies = [
+     "h5py>=3.1.0",
+     "jsonschema>=3.2.0",
+     'numpy>=1.19.3',
+-    "pandas>=1.2.0,<3",
++    "pandas>=1.2.0",
+     "ruamel.yaml>=0.16",
+ ]
+ dynamic = ["version"]
+diff --git a/src/hdmf/container.py b/src/hdmf/container.py
+index 7e6334f5..800668d1 100644
+--- a/src/hdmf/container.py
++++ b/src/hdmf/container.py
+@@ -12,7 +12,7 @@ import pandas as pd
+ 
+ from .data_utils import DataIO, append_data, extend_data, AbstractDataChunkIterator
+ from .utils import (docval, get_docval, getargs, ExtenderMeta, get_data_shape, popargs, LabelledDict,
+-                    get_basic_array_info, generate_array_html_repr)
++                    get_basic_array_info, generate_array_html_repr, coerce_pandas_data)
+ 
+ from .term_set import TermSet, TermSetWrapper
+ 
+@@ -927,6 +927,7 @@ class Data(AbstractContainer):
+         data = popargs('data', kwargs)
+         super().__init__(**kwargs)
+ 
++        data = coerce_pandas_data(data)
+         self._validate_new_data(data)
+         self.__data = data
+ 
+@@ -1020,6 +1021,7 @@ class Data(AbstractContainer):
+ 
+         :param arg: The iterable to add to the end of this VectorData
+         """
++        arg = coerce_pandas_data(arg)
+         self._validate_new_data(arg)
+         self.__data = extend_data(self.__data, arg)
+ 
+diff --git a/src/hdmf/utils.py b/src/hdmf/utils.py
+index c7fe2b47..62969595 100644
+--- a/src/hdmf/utils.py
++++ b/src/hdmf/utils.py
+@@ -8,10 +8,12 @@ from enum import Enum
+ 
+ import h5py
+ import numpy as np
++import pandas as pd
++from pandas.api.extensions import ExtensionArray as _PandasExtensionArray
+ 
+ 
+ __macros = {
+-    'array_data': [np.ndarray, list, tuple, h5py.Dataset],
++    'array_data': [np.ndarray, list, tuple, h5py.Dataset, pd.Series, _PandasExtensionArray],
+     'scalar_data': [str, int, float, bytes, bool],
+     'data': []
+ }
+@@ -26,6 +28,47 @@ except ImportError:
+ def is_zarr_array(value):
+     return ZARR_INSTALLED and isinstance(value, ZarrArray)
+ 
++
++def coerce_pandas_data(data):
++    """Convert a pandas Series or ExtensionArray to a numpy array for HDMF storage.
++
++    HDMF stores dataset values as numpy arrays (or array-likes such as h5py.Dataset).
++    Pandas Series and ExtensionArray inputs are normalized at the construction
++    boundary so that downstream code only has to handle numpy/list/tuple data.
++
++    Raises:
++        TypeError: if the input contains missing values (pd.NA / np.nan), which
++            cannot be serialized to HDF5 variable-length string datasets and which
++            HDMF does not support for other dtypes.
++    """
++    if isinstance(data, pd.Series):
++        underlying = data.array
++    elif isinstance(data, _PandasExtensionArray):
++        underlying = data
++    else:
++        return data
++
++    if pd.isna(underlying).any():
++        raise TypeError(
++            "Cannot construct an HDMF dataset from pandas data containing missing "
++            "values (pd.NA or NaN). HDF5 cannot serialize missing values in "
++            "variable-length string datasets, and HDMF does not yet support "
++            "missing values for other dtypes. Replace missing values with a "
++            "sentinel (e.g., empty string) before passing the data to HDMF."
++        )
++
++    # pandas nullable masked dtypes (e.g. Int64, boolean, Float64) expose the
++    # backing numpy dtype. Convert through it so the result keeps that dtype on
++    # all supported pandas versions; a plain to_numpy()/np.asarray() returns an
++    # object array on pandas < 2.2.
++    numpy_dtype = getattr(underlying.dtype, "numpy_dtype", None)
++    if numpy_dtype is not None:
++        return underlying.to_numpy(dtype=numpy_dtype)
++
++    if isinstance(data, pd.Series):
++        return data.to_numpy()
++    return np.asarray(data)
++
+ if ZARR_INSTALLED:
+     # optionally accept zarr.Array as array data to support conversion of data from Zarr to HDMF
+     __macros['array_data'].append(ZarrArray)
+diff --git a/tests/unit/utils_test/test_utils.py b/tests/unit/utils_test/test_utils.py
+index 3b8fb101..96b704c7 100644
+--- a/tests/unit/utils_test/test_utils.py
++++ b/tests/unit/utils_test/test_utils.py
+@@ -2,10 +2,11 @@ import os
+ 
+ import h5py
+ import numpy as np
++import pandas as pd
+ from hdmf.container import Data
+ from hdmf.data_utils import DataChunkIterator, DataIO
+ from hdmf.testing import TestCase
+-from hdmf.utils import get_data_shape, to_uint_array, is_newer_version
++from hdmf.utils import get_data_shape, to_uint_array, is_newer_version, coerce_pandas_data
+ 
+ 
+ class TestGetDataShape(TestCase):
+@@ -221,6 +222,97 @@ class TestToUintArray(TestCase):
+         with self.assertRaisesWith(ValueError, 'Cannot convert array of dtype float64 to uint.'):
+             to_uint_array(arr)
+ 
++class TestCoercePandasData(TestCase):
++    """Tests for coerce_pandas_data, which normalizes pandas Series/ExtensionArray to numpy."""
++
++    def test_passthrough_non_pandas(self):
++        arr = np.array([1, 2, 3])
++        self.assertIs(coerce_pandas_data(arr), arr)
++        lst = [1, 2, 3]
++        self.assertIs(coerce_pandas_data(lst), lst)
++
++    def test_string_array(self):
++        sa = pd.array(['a', 'b', 'c'], dtype='string')
++        out = coerce_pandas_data(sa)
++        self.assertIsInstance(out, np.ndarray)
++        self.assertEqual(list(out), ['a', 'b', 'c'])
++
++    def test_arrow_string_array(self):
++        try:
++            asa = pd.array(['a', 'b', 'c'], dtype='string[pyarrow]')
++        except ImportError:
++            self.skipTest('pyarrow not installed')
++        out = coerce_pandas_data(asa)
++        self.assertIsInstance(out, np.ndarray)
++        self.assertEqual(list(out), ['a', 'b', 'c'])
++
++    def test_series_string(self):
++        s = pd.Series(['a', 'b', 'c'], dtype='string')
++        out = coerce_pandas_data(s)
++        self.assertIsInstance(out, np.ndarray)
++        self.assertEqual(list(out), ['a', 'b', 'c'])
++
++    def test_series_numeric_lossless(self):
++        s = pd.Series([1, 2, 3])
++        out = coerce_pandas_data(s)
++        self.assertIsInstance(out, np.ndarray)
++        self.assertEqual(out.dtype, np.int64)
++        np.testing.assert_array_equal(out, [1, 2, 3])
++
++    def test_categorical(self):
++        cat = pd.Categorical(['x', 'y', 'x'])
++        out = coerce_pandas_data(cat)
++        self.assertIsInstance(out, np.ndarray)
++        self.assertEqual(list(out), ['x', 'y', 'x'])
++
++    def test_string_array_with_na_raises(self):
++        sa = pd.array(['a', None, 'c'], dtype='string')
++        with self.assertRaisesRegex(TypeError, 'missing values'):
++            coerce_pandas_data(sa)
++
++    def test_series_object_with_nan_raises(self):
++        s = pd.Series(['a', np.nan, 'c'])
++        with self.assertRaisesRegex(TypeError, 'missing values'):
++            coerce_pandas_data(s)
++
++    def test_integer_array_lossless(self):
++        ia = pd.array([1, 2, 3], dtype='Int64')
++        out = coerce_pandas_data(ia)
++        self.assertIsInstance(out, np.ndarray)
++        self.assertEqual(out.dtype, np.int64)
++        np.testing.assert_array_equal(out, [1, 2, 3])
++
++    def test_boolean_array_lossless(self):
++        ba = pd.array([True, False, True], dtype='boolean')
++        out = coerce_pandas_data(ba)
++        self.assertIsInstance(out, np.ndarray)
++        self.assertEqual(out.dtype, np.bool_)
++        np.testing.assert_array_equal(out, [True, False, True])
++
++    def test_integer_array_with_na_raises(self):
++        ia = pd.array([1, None, 3], dtype='Int64')
++        with self.assertRaisesRegex(TypeError, 'missing values'):
++            coerce_pandas_data(ia)
++
++
++class TestDataAcceptsPandas(TestCase):
++    """Verify pandas Series/ExtensionArray flow through Data construction."""
++
++    def test_vector_data_from_arrow_string_values(self):
++        from hdmf.common import VectorData
++        df = pd.DataFrame({'animal': ['cat', 'dog', 'bird']})
++        vd = VectorData(name='animal', description='', data=df['animal'].values)
++        self.assertIsInstance(vd.data, np.ndarray)
++        self.assertEqual(list(vd.data), ['cat', 'dog', 'bird'])
++
++    def test_vector_data_from_series(self):
++        from hdmf.common import VectorData
++        s = pd.Series(['a', 'b', 'c'])
++        vd = VectorData(name='s', description='', data=s)
++        self.assertIsInstance(vd.data, np.ndarray)
++        self.assertEqual(list(vd.data), ['a', 'b', 'c'])
++
++
+ class TestVersionComparison(TestCase):
+     """Test the version comparison functionality in NamespaceCatalog."""
+ 
+-- 
+2.54.0
+

diff --git a/python-hdmf.spec b/python-hdmf.spec
index 2fc2e0e..b55d465 100644
--- a/python-hdmf.spec
+++ b/python-hdmf.spec
@@ -48,6 +48,10 @@ URL:            %forgeurl
 Source0:        %forgesource
 # Man page hand-written for Fedora in groff_man(7) format based on help output
 Source1:        validate_hdmf_spec.1
+# Accept pandas Series/ExtensionArray for Data; lift pandas<3 cap (#1469)
+# https://github.com/hdmf-dev/hdmf/commit/744cf1971f92f34673c41b55376952f8ffe4707f
+# Backported to 4.3.1, without modifications to CHANGELOG.md
+Patch:          0001-Accept-pandas-Series-ExtensionArray-for-Data-lift-pa.patch
 
 BuildArch:      noarch
 
@@ -83,7 +87,9 @@ Obsoletes:      python3-hdmf+zarr < 4.1.0-2
 rm -vrf src/hdmf/common/hdmf-common-schema/
 # Upstream pins numcodecs because “numcodecs 0.16.0 is not compatible with
 # zarr<3,” but we cannot respect this.
-sed -r -i 's/("numcodecs)<[^"]+"/\1"/' pyproject.toml
+%pyproject_patch_dependency numcodecs:drop_upper
+# Allow pandas 3
+%pyproject_patch_dependency pandas:set_upper:4.0
 
 %generate_buildrequires
 %pyproject_buildrequires -x tqdm%{?with_zarr:,zarr},sparse%{?with_termset:,termset}

                 reply	other threads:[~2026-06-25 10:30 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=178238341899.1.5316945448875546916.rpms-python-hdmf-8b53fc51bb70@fedoraproject.org \
    --to=code@musicinmybrain.net \
    --cc=git-commits@fedoraproject.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox