Pandas Extension Types

Text Extensions for Pandas includes extension types for representing spans and tensors inside Pandas DataFrames. This section describes the Python classes that implement these types.

Span Extension Type

The SpanDtype extension data type efficiently stores span data in a Pandas Series. Each span is represented by begin and end character offsets into a target document. We use dense NumPy arrays for efficient internal storage.

class text_extensions_for_pandas.SpanDtype[source]

Panda datatype for a span that represents a range of characters within a target string.

SpanArray Class: Store spans in a Pandas Series

class text_extensions_for_pandas.SpanArray(text: Union[str, Sequence[str], numpy.ndarray, Tuple[text_extensions_for_pandas.array.string_table.StringTable, numpy.ndarray]], begins: Union[pandas.core.series.Series, numpy.ndarray, Sequence[int]], ends: Union[pandas.core.series.Series, numpy.ndarray, Sequence[int]])[source]

A Pandas ExtensionArray that represents a column of character-based spans over a single target text.

Spans are represented as [begin, end) intervals, where begin and end are character offsets into the target text.

Public Data Attributes:

dtype

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

nbytes

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

target_text

"document" texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.

document_text

if all spans in this array cover the same document, text of that document.

is_single_document

True if there is at least one span in the and every span is over the same target text.

begin

end

version

Monotonically increasing version number that changes every time this array is modified.

covered_text

an array of the substrings of target_text corresponding to the spans in this array.

normalized_covered_text

A normalized version of the covered text of the spans in this array.

repr_html_show_offsets

Whether the HTML/Jupyter notebook representation of this array will contain a table of span offsets in addition to the marked-up target text.

Inherited from ExtensionArray

dtype

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

shape

Return a tuple of the array dimensions.

size

The number of elements in the array.

ndim

Extension Arrays are only allowed to be 1-dimensional.

nbytes

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

T

Public Methods:

__init__(text, begins, ends)

Factory method for creating instances of this class.

astype(dtype[, copy])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__len__()

Length of this array

__getitem__(item)

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__setitem__(key, value)

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__eq__(other)

Pandas/Numpy-style array/series comparison function.

__ne__(other)

Return for self != other (element-wise in-equality).

__hash__()

Return hash(self).

__contains__(item)

Return true if scalar item exists in this SpanArray.

equals(other)

param other

A second SpanArray

isna()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

copy()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

take(indices[, allow_fill, fill_value])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__lt__(other)

Pandas-style array/series comparison function.

__gt__(other)

Return self>value.

__le__(other)

Return self<=value.

__ge__(other)

Return self>=value.

make_array(o)

Make a SpanArray object out of any of several types of input.

split_by_document()

return

A list of slices of this SpanArray that cover single documents.

increment_version()

Manually increase the version counter of this array to indicate that the array's contents have changed.

as_tuples()

returns

(begin, end) pairs as an array of tuples

as_frame()

Returns a dataframe representation of this column based on Python atomic types.

same_target_text(other)

param other

Either a single span or an array of spans of the same

overlaps(other)

param other

Either a single span or an array of spans of the same

contains(other)

param other

Either a single span or an array of spans of the same

__arrow_array__([type])

Conversion of this Array to a pyarrow.ExtensionArray.

Inherited from ExtensionArray

__getitem__(item)

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__setitem__(key, value)

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__len__()

Length of this array

__iter__()

Iterate over elements of the array.

__contains__(item)

Return true if scalar item exists in this SpanArray.

__eq__(other)

Pandas/Numpy-style array/series comparison function.

__ne__(other)

Return for self != other (element-wise in-equality).

to_numpy([dtype, copy, na_value])

Convert to a NumPy ndarray.

astype(dtype[, copy])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

isna()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

argsort([ascending, kind, na_position])

Return the indices that would sort this array.

argmin([skipna])

Return the index of minimum value.

argmax([skipna])

Return the index of maximum value.

fillna([value, method, limit])

Fill NA/NaN values using the specified method.

dropna()

Return ExtensionArray without NA values.

shift([periods, fill_value])

Shift values by desired number.

unique()

Compute the ExtensionArray of unique values.

searchsorted(value[, side, sorter])

Find indices where elements should be inserted to maintain order.

equals(other)

param other

A second SpanArray

isin(values)

Pointwise comparison for set containment in the given values.

factorize([na_sentinel])

Encode the extension array as an enumerated type.

repeat(repeats[, axis])

Repeat elements of a ExtensionArray.

take(indices[, allow_fill, fill_value])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

copy()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

view([dtype])

Return a view on the array.

__repr__()

Return repr(self).

transpose(*axes)

Return a transposed view on this array.

ravel([order])

Return a flattened view on this array.

tolist()

Return a list of the values.

delete(loc)

insert(loc, item)

Insert an item at the given position.

__array_ufunc__(ufunc, method, *inputs, **kwargs)

Inherited from SpanOpMixin

__add__(other)

Add a pair of spans and/or span arrays.


as_frame() pandas.core.frame.DataFrame[source]

Returns a dataframe representation of this column based on Python atomic types.

as_tuples() numpy.ndarray[source]
Returns

(begin, end) pairs as an array of tuples

astype(dtype, copy=True)[source]

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

contains(other: Union[text_extensions_for_pandas.array.span.SpanArray, text_extensions_for_pandas.array.span.Span])[source]
Parameters

other – Either a single span or an array of spans of the same length as this one

Returns

Numpy array containing a boolean mask of all entries that contain the corresponding element of other

copy() text_extensions_for_pandas.array.span.SpanArray[source]

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property covered_text: numpy.ndarray

an array of the substrings of target_text corresponding to the spans in this array.

Type

return

property document_text: Optional[str]

if all spans in this array cover the same document, text of that document. Raises a ValueError if the array is empty or if the Spans in this array cover more than one document.

Type

return

property dtype: pandas.core.dtypes.base.ExtensionDtype

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

equals(other: text_extensions_for_pandas.array.span.SpanArray)[source]
Parameters

other – A second SpanArray

Returns

True if both arrays have the same target texts (can be a different string object with the same contents) and the same spans in the same order.

increment_version()[source]

Manually increase the version counter of this array to indicate that the array’s contents have changed. Also invalidates any internal cached data derived from the array’s state.

property is_single_document: bool

True if there is at least one span in the and every span is over the same target text.

Type

return

isna() numpy.array[source]

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

classmethod make_array(o) text_extensions_for_pandas.array.span.SpanArray[source]

Make a SpanArray object out of any of several types of input.

Parameters

o – a SpanArray object represented as a pd.Series, a list of Span objects, or maybe just an actual SpanArray (or TokenSpanArray) object.

Returns

SpanArray version of o, which may be a pointer to o or one of its fields.

property nbytes: int

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property normalized_covered_text: numpy.ndarray

A normalized version of the covered text of the spans in this array. Currently “normalized” means “lowercase”.

Type

return

overlaps(other: Union[text_extensions_for_pandas.array.span.SpanArray, text_extensions_for_pandas.array.span.Span])[source]
Parameters

other – Either a single span or an array of spans of the same length as this one

Returns

Numpy array containing a boolean mask of all entries that overlap the corresponding element of other

property repr_html_show_offsets

Whether the HTML/Jupyter notebook representation of this array will contain a table of span offsets in addition to the marked-up target text.

Type

@returns

same_target_text(other: Union[text_extensions_for_pandas.array.span.SpanArray, text_extensions_for_pandas.array.span.Span])[source]
Parameters

other – Either a single span or an array of spans of the same length as this one

Returns

Numpy array containing a boolean mask of all entries that have the same target text. Two spans with target text of None are considered to have the same target text.

split_by_document() List[text_extensions_for_pandas.array.span.SpanArray][source]
Returns

A list of slices of this SpanArray that cover single documents.

take(indices: Sequence[int], allow_fill: bool = False, fill_value: Optional[Any] = None) text_extensions_for_pandas.array.span.SpanArray[source]

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property target_text: numpy.ndarray

“document” texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.

Type

return

property version: int

Monotonically increasing version number that changes every time this array is modified. NOTE: This number might not change if a caller obtains a pointer to an internal array and modifies it. Callers who perform such modifications should call increment_version()

Type

return

Span Class: Object to represent a single span

class text_extensions_for_pandas.Span(text: str, begin: int, end: int)[source]

Python object representation of a single span with character offsets; that is, a single row of a SpanArray.

An offset of Span.NULL_OFFSET_VALUE (currently -1) indicates “not a span” in the sense that NaN is “not a number”.

Most of the methods and properties of this class are single-span versions of the

eponymous methods in SpanArray. See that class for API documentation.

Public Data Attributes:

NULL_OFFSET_VALUE

begin

end

target_text

covered_text

Returns the substring of self.target_text that this Span represents.

Public Methods:

__init__(text, begin, end)

param text

target document text on which the span is defined

__repr__()

Return repr(self).

__eq__(other)

Return self==value.

__hash__()

Return hash(self).

__lt__(other)

span1 < span2 if span1.end <= span2.begin and both spans are over the same target text

__gt__(other)

Return self>value.

__le__(other)

Return self<=value.

__ge__(other)

Return self>=value.

overlaps(other)

param other

Another Span or TokenSpan

contains(other)

param other

Another Span or TokenSpan

context([num_chars])

Show the location of this span in the context of the target string.

Inherited from SpanOpMixin

__add__(other)

Add a pair of spans and/or span arrays.


contains(other: text_extensions_for_pandas.array.span.Span)[source]
Parameters

other – Another Span or TokenSpan

Returns

True if other is entirely within the bounds of this span. Also True if a zero-length span is contained within the other.

context(num_chars: int = 40) str[source]

Show the location of this span in the context of the target string.

Parameters

num_chars – How many characters on either side to display

Returns

A string in the form: `<text before>[<text inside>]<text after>` describing the text within and around the span.

property covered_text

Returns the substring of self.target_text that this Span represents.

overlaps(other: text_extensions_for_pandas.array.span.Span)[source]
Parameters

other – Another Span or TokenSpan

Returns

True if the two spans overlap. Also True if a zero-length span is contained within the other.

Token-Based Span Extension Type

The TokenSpanDtype extension data type is similar to SpanDtype, except that it represents spans using begin and end offsets into the tokens of a target document. These tokens are stored in a (shared) SpanArray object.

class text_extensions_for_pandas.TokenSpanDtype[source]

Pandas datatype for a span that represents a range of tokens within a target string.

TokenSpanArray Class: Store token-based spans in a Pandas Series

class text_extensions_for_pandas.TokenSpanArray(tokens: Union[text_extensions_for_pandas.array.span.SpanArray, Sequence[text_extensions_for_pandas.array.span.SpanArray]], begin_tokens: Optional[Union[pandas.core.series.Series, numpy.ndarray, Sequence[int]]] = None, end_tokens: Optional[Union[pandas.core.series.Series, numpy.ndarray, Sequence[int]]] = None)[source]

A Pandas ExtensionArray that represents a column of token-based spans over a single target text.

Spans are represented internally as [begin_token, end_token) intervals, where the properties begin_token and end_token are token offsets into the target text. As with the parent class SpanArray, the properties begin and end of a TokenSpanArray return character offsets.

Null values are encoded with begin and end offsets of TokenSpan.NULL_OFFSET_VALUE.

Fields:

  • self._tokens: Reference to the target string’s tokens as a SpanArray. For now, references to different SpanArray objects are treated as different even if the arrays have the same contents.

  • self._begin_tokens: Numpy array of integer offsets in tokens. An offset of TokenSpan.NULL_OFFSET_VALUE here indicates a null value.

  • self._end_tokens: Numpy array of end offsets (1 + last token in span).

Public Data Attributes:

dtype

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

nbytes

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

tokens

The tokens over which each TokenSpan in this array are defined as an ndarray of object.

target_text

"document" texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.

document_text

if all spans in this array cover the same document, text of that document.

document_tokens

if all spans in this array cover the same tokenization of a single document, tokens of that document.

nulls_mask

A boolean mask indicating which rows are nulls

begin

the character offsets of the span begins.

end

the character offsets of the span ends.

begin_token

Token offsets of the span begins; that is, the index of the first token in each span.

end_token

Token offsets of the span ends.

covered_text

Returns an array of the substrings of target_text corresponding to the spans in this array.

is_single_document

True if every span in this array is over the same target text or if there are zero spans in this array.

is_single_tokenization

True if every span in this array is over the same tokenization of the same target text or if there are zero spans in this array.

Inherited from SpanArray

dtype

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

nbytes

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

target_text

"document" texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.

document_text

if all spans in this array cover the same document, text of that document.

is_single_document

True if every span in this array is over the same target text or if there are zero spans in this array.

begin

the character offsets of the span begins.

end

the character offsets of the span ends.

version

Monotonically increasing version number that changes every time this array is modified.

covered_text

Returns an array of the substrings of target_text corresponding to the spans in this array.

normalized_covered_text

A normalized version of the covered text of the spans in this array.

repr_html_show_offsets

Whether the HTML/Jupyter notebook representation of this array will contain a table of span offsets in addition to the marked-up target text.

Inherited from ExtensionArray

dtype

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

shape

Return a tuple of the array dimensions.

size

The number of elements in the array.

ndim

Extension Arrays are only allowed to be 1-dimensional.

nbytes

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

T

Public Methods:

__init__(tokens[, begin_tokens, end_tokens])

param tokens

Character-level span information about the underlying

from_char_offsets(tokens)

Convenience factory method for wrapping the character-level spans of a series of tokens into single-token token-based spans.

astype(dtype[, copy])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__len__()

Length of this array

__getitem__(item)

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__setitem__(key, value)

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__eq__(other)

Pandas/Numpy-style array/series comparison function.

__hash__()

Return hash(self).

__contains__(item)

Return true if scalar item exists in this TokenSpanArray.

__le__(other)

Return self<=value.

__ge__(other)

Return self>=value.

isna()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

copy()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

take(indices[, allow_fill, fill_value])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

make_array(o)

Make a TokenSpanArray object out of any of several types of input.

align_to_tokens(tokens, spans)

Align a set of character or token-based spans to a specified tokenization, producing a TokenSpanArray of token-based spans.

as_tuples()

Returns (begin, end) pairs as an array of tuples

increment_version()

Override parent class's version of this function to also clear out data cached in the subclass.

as_frame()

Returns a dataframe representation of this column based on Python atomic types.

same_target_text(other)

param other

Either a single span or an array of spans of the same

same_tokens(other)

param other

Either a single span or an array of spans of the same

split_by_document()

return

A list of slices of this SpanArray that cover single documents.

__arrow_array__([type])

Conversion of this Array to a pyarrow.ExtensionArray.

Inherited from SpanArray

__init__(tokens[, begin_tokens, end_tokens])

param tokens

Character-level span information about the underlying

astype(dtype[, copy])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__len__()

Length of this array

__getitem__(item)

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__setitem__(key, value)

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__eq__(other)

Pandas/Numpy-style array/series comparison function.

__ne__(other)

Return for self != other (element-wise in-equality).

__hash__()

Return hash(self).

__contains__(item)

Return true if scalar item exists in this TokenSpanArray.

equals(other)

param other

A second SpanArray

isna()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

copy()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

take(indices[, allow_fill, fill_value])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__lt__(other)

Pandas-style array/series comparison function.

__gt__(other)

Return self>value.

__le__(other)

Return self<=value.

__ge__(other)

Return self>=value.

make_array(o)

Make a TokenSpanArray object out of any of several types of input.

split_by_document()

return

A list of slices of this SpanArray that cover single documents.

increment_version()

Override parent class's version of this function to also clear out data cached in the subclass.

as_tuples()

Returns (begin, end) pairs as an array of tuples

as_frame()

Returns a dataframe representation of this column based on Python atomic types.

same_target_text(other)

param other

Either a single span or an array of spans of the same

overlaps(other)

param other

Either a single span or an array of spans of the same

contains(other)

param other

Either a single span or an array of spans of the same

__arrow_array__([type])

Conversion of this Array to a pyarrow.ExtensionArray.

Inherited from ExtensionArray

__getitem__(item)

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__setitem__(key, value)

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__len__()

Length of this array

__iter__()

Iterate over elements of the array.

__contains__(item)

Return true if scalar item exists in this TokenSpanArray.

__eq__(other)

Pandas/Numpy-style array/series comparison function.

__ne__(other)

Return for self != other (element-wise in-equality).

to_numpy([dtype, copy, na_value])

Convert to a NumPy ndarray.

astype(dtype[, copy])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

isna()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

argsort([ascending, kind, na_position])

Return the indices that would sort this array.

argmin([skipna])

Return the index of minimum value.

argmax([skipna])

Return the index of maximum value.

fillna([value, method, limit])

Fill NA/NaN values using the specified method.

dropna()

Return ExtensionArray without NA values.

shift([periods, fill_value])

Shift values by desired number.

unique()

Compute the ExtensionArray of unique values.

searchsorted(value[, side, sorter])

Find indices where elements should be inserted to maintain order.

equals(other)

param other

A second SpanArray

isin(values)

Pointwise comparison for set containment in the given values.

factorize([na_sentinel])

Encode the extension array as an enumerated type.

repeat(repeats[, axis])

Repeat elements of a ExtensionArray.

take(indices[, allow_fill, fill_value])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

copy()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

view([dtype])

Return a view on the array.

__repr__()

Return repr(self).

transpose(*axes)

Return a transposed view on this array.

ravel([order])

Return a flattened view on this array.

tolist()

Return a list of the values.

delete(loc)

insert(loc, item)

Insert an item at the given position.

__array_ufunc__(ufunc, method, *inputs, **kwargs)

Inherited from TokenSpanOpMixin

__add__(other)

Add a pair of spans and/or span arrays.

Inherited from SpanOpMixin

__add__(other)

Add a pair of spans and/or span arrays.


classmethod align_to_tokens(tokens: Any, spans: Any)[source]

Align a set of character or token-based spans to a specified tokenization, producing a TokenSpanArray of token-based spans.

Parameters
  • tokens – The tokens to align to, as any type that SpanArray.make_array() accepts.

  • spans – The spans to align. These spans must all target the same text as tokens.

Returns

An array of TokenSpan objects aligned to the tokens of tokens. Raises ValueError if any of the spans in spans doesn’t start and end on a token boundary.

as_frame() pandas.core.frame.DataFrame[source]

Returns a dataframe representation of this column based on Python atomic types.

as_tuples() numpy.ndarray[source]

Returns (begin, end) pairs as an array of tuples

astype(dtype, copy=True)[source]

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property begin: numpy.ndarray

the character offsets of the span begins.

Type

return

property begin_token: numpy.ndarray

Token offsets of the span begins; that is, the index of the first token in each span.

Type

return

copy() text_extensions_for_pandas.array.token_span.TokenSpanArray[source]

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property covered_text: numpy.ndarray

Returns an array of the substrings of target_text corresponding to the spans in this array.

property document_text: Optional[str]

if all spans in this array cover the same document, text of that document. Raises a ValueError if the array is empty or if the Spans in this array cover more than one document.

Type

return

property document_tokens: Optional[text_extensions_for_pandas.array.span.SpanArray]

if all spans in this array cover the same tokenization of a single document, tokens of that document. Raises a ValueError if the array is empty or if the Spans in this array cover more than one document.

Type

return

property dtype: pandas.core.dtypes.base.ExtensionDtype

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property end: numpy.ndarray

the character offsets of the span ends.

Type

return

property end_token: numpy.ndarray

Token offsets of the span ends. That is, 1 + last token present in the span, for each span in the column.

Type

return

static from_char_offsets(tokens: Any) text_extensions_for_pandas.array.token_span.TokenSpanArray[source]

Convenience factory method for wrapping the character-level spans of a series of tokens into single-token token-based spans.

Parameters

tokens – character-based offsets of the tokens, as any type that SpanArray.make_array() understands.

Returns

A TokenSpanArray containing single-token spans for each of the tokens in tokens.

increment_version()[source]

Override parent class’s version of this function to also clear out data cached in the subclass.

property is_single_document: bool

True if every span in this array is over the same target text or if there are zero spans in this array.

Type

return

property is_single_tokenization: bool

True if every span in this array is over the same tokenization of the same target text or if there are zero spans in this array.

Type

return

isna() numpy.array[source]

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

classmethod make_array(o) text_extensions_for_pandas.array.token_span.TokenSpanArray[source]

Make a TokenSpanArray object out of any of several types of input.

Parameters

o – a TokenSpanArray object represented as a pd.Series, a list of TokenSpan objects, or an actual TokenSpanArray object.

Returns

TokenSpanArray version of o, which may be a pointer to o or one of its fields.

property nbytes: int

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property nulls_mask: numpy.ndarray

A boolean mask indicating which rows are nulls

Type

return

same_target_text(other: Union[text_extensions_for_pandas.array.span.SpanArray, text_extensions_for_pandas.array.span.Span])[source]
Parameters

other – Either a single span or an array of spans of the same length as this one

Returns

Numpy array containing a boolean mask of all entries that have the same target text. Two spans with target text of None are considered to have the same target text.

same_tokens(other: Union[text_extensions_for_pandas.array.token_span.TokenSpanArray, text_extensions_for_pandas.array.token_span.TokenSpan])[source]
Parameters

other – Either a single span or an array of spans of the same length as this one. Must be token-based.

Returns

Numpy array containing a boolean mask of all entries that are over the same tokenization of the same target text. Two spans with target text of None are considered to have the same target text.

split_by_document() List[text_extensions_for_pandas.array.span.SpanArray][source]
Returns

A list of slices of this SpanArray that cover single documents.

take(indices: Sequence[int], allow_fill: bool = False, fill_value: Optional[Any] = None) text_extensions_for_pandas.array.token_span.TokenSpanArray[source]

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property target_text: numpy.ndarray

“document” texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.

Type

return

property tokens: numpy.ndarray

The tokens over which each TokenSpan in this array are defined as an ndarray of object.

Type

return

TokenSpan Class: Object to represent a single token-based span

class text_extensions_for_pandas.TokenSpan(tokens: Any, begin_token: int, end_token: int)[source]

Python object representation of a single span with token offsets; that is, a single row of a TokenSpanArray.

This class is also a subclass of Span and can return character-level information.

An offset of TokenSpan.NULL_OFFSET_VALUE (currently -1) indicates “not a span” in the sense that NaN is “not a number”.

Public Data Attributes:

USE_TOKEN_OFFSETS_IN_REPR

tokens

begin_token

end_token

Inherited from Span

NULL_OFFSET_VALUE

begin

end

target_text

covered_text

Returns the substring of self.target_text that this Span represents.

Public Methods:

__init__(tokens, begin_token, end_token)

param tokens

Tokenization information about the document, including

make_null(tokens)

Convenience method for building null spans.

__repr__()

Return repr(self).

__eq__(other)

Return self==value.

__hash__()

Return hash(self).

__lt__(other)

span1 < span2 if span1.end <= span2.begin

Inherited from Span

__init__(tokens, begin_token, end_token)

param tokens

Tokenization information about the document, including

__repr__()

Return repr(self).

__eq__(other)

Return self==value.

__hash__()

Return hash(self).

__lt__(other)

span1 < span2 if span1.end <= span2.begin

__gt__(other)

Return self>value.

__le__(other)

Return self<=value.

__ge__(other)

Return self>=value.

overlaps(other)

param other

Another Span or TokenSpan

contains(other)

param other

Another Span or TokenSpan

context([num_chars])

Show the location of this span in the context of the target string.

Inherited from TokenSpanOpMixin

__add__(other)

Add a pair of spans and/or span arrays.

Inherited from SpanOpMixin

__add__(other)

Add a pair of spans and/or span arrays.


classmethod make_null(tokens)[source]

Convenience method for building null spans. :param tokens: Tokens of the target string :return: A null span over the indicated tokens

Tensor Extension Type

The TensorDtype extension data type is efficiently stores tensors in the rows of a Pandas Series. For efficiency, we store all of the tensors in a Series in a single NumPy array.

class text_extensions_for_pandas.TensorDtype[source]

Pandas data type for a column of tensors with the same shape.

TensorArray Class: Store tensors in a Pandas Series

class text_extensions_for_pandas.TensorArray(values: Union[numpy.ndarray, Sequence[Union[numpy.ndarray, text_extensions_for_pandas.array.tensor.TensorElement]], text_extensions_for_pandas.array.tensor.TensorElement, Any])[source]

A Pandas ExtensionArray that represents a column of numpy.ndarray objects, or tensors, where the outer dimension is the count of tensors in the column. Each tensor must have the same shape.

Public Data Attributes:

dtype

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

inferred_type

Return string describing type of TensorArray.

nbytes

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

numpy_dtype

Get the dtype of the tensor.

numpy_ndim

Get the number of tensor dimensions.

numpy_shape

Get the shape of the tensor.

Inherited from ExtensionArray

dtype

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

shape

Return a tuple of the array dimensions.

size

The number of elements in the array.

ndim

Extension Arrays are only allowed to be 1-dimensional.

nbytes

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

T

Public Methods:

__init__(values)

param values

A numpy.ndarray or sequence of

isna()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

copy()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

take(indices[, allow_fill, fill_value])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

to_numpy([dtype, copy, na_value])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

astype(dtype[, copy])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

any([axis, out, keepdims])

Test whether any array element along a given axis evaluates to True.

all([axis, out, keepdims])

Test whether all array elements along a given axis evaluate to True.

__len__()

Length of this array

__getitem__(item)

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__setitem__(key, value)

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__contains__(item)

Return for item in self.

__repr__()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__str__()

Return str(self).

__array__([dtype])

Interface to return the backing tensor as a numpy array with optional dtype.

__array_ufunc__(ufunc, method, *inputs, **kwargs)

Interface to handle numpy ufuncs that will accept TensorArray as input, and wrap the output back as another TensorArray.

__arrow_array__([type])

__add__(other)

__radd__(other)

__sub__(other)

__rsub__(other)

__mul__(other)

__rmul__(other)

__pow__(other)

__rpow__(other)

__mod__(other)

__rmod__(other)

__floordiv__(other)

__rfloordiv__(other)

__truediv__(other)

__rtruediv__(other)

__divmod__(other)

__rdivmod__(other)

__eq__(other)

Return for self == other (element-wise equality).

__ne__(other)

Return for self != other (element-wise in-equality).

__lt__(other)

Return self<value.

__gt__(other)

Return self>value.

__le__(other)

Return self<=value.

__ge__(other)

Return self>=value.

Inherited from ExtensionArray

__getitem__(item)

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__setitem__(key, value)

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

__len__()

Length of this array

__iter__()

Iterate over elements of the array.

__contains__(item)

Return for item in self.

__eq__(other)

Return for self == other (element-wise equality).

__ne__(other)

Return for self != other (element-wise in-equality).

to_numpy([dtype, copy, na_value])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

astype(dtype[, copy])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

isna()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

argsort([ascending, kind, na_position])

Return the indices that would sort this array.

argmin([skipna])

Return the index of minimum value.

argmax([skipna])

Return the index of maximum value.

fillna([value, method, limit])

Fill NA/NaN values using the specified method.

dropna()

Return ExtensionArray without NA values.

shift([periods, fill_value])

Shift values by desired number.

unique()

Compute the ExtensionArray of unique values.

searchsorted(value[, side, sorter])

Find indices where elements should be inserted to maintain order.

equals(other)

Return if another array is equivalent to this array.

isin(values)

Pointwise comparison for set containment in the given values.

factorize([na_sentinel])

Encode the extension array as an enumerated type.

repeat(repeats[, axis])

Repeat elements of a ExtensionArray.

take(indices[, allow_fill, fill_value])

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

copy()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

view([dtype])

Return a view on the array.

__repr__()

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

transpose(*axes)

Return a transposed view on this array.

ravel([order])

Return a flattened view on this array.

tolist()

Return a list of the values.

delete(loc)

insert(loc, item)

Insert an item at the given position.

__array_ufunc__(ufunc, method, *inputs, **kwargs)

Interface to handle numpy ufuncs that will accept TensorArray as input, and wrap the output back as another TensorArray.


all(axis=None, out=None, keepdims=False)[source]

Test whether all array elements along a given axis evaluate to True.

Parameters
  • axis – Axis or axes along which a logical AND reduction is performed.

  • out – Alternate output array in which to place the result.

  • keepdims – If this is set to True, the axes which are reduced are left in the result as dimensions with size one.

Returns

single boolean unless axis is not None; else TensorArray

any(axis=None, out=None, keepdims=False)[source]

Test whether any array element along a given axis evaluates to True.

See numpy.any() documentation for more information https://numpy.org/doc/stable/reference/generated/numpy.any.html#numpy.any

Parameters
  • axis – Axis or axes along which a logical OR reduction is performed.

  • out – Alternate output array in which to place the result.

  • keepdims – If this is set to True, the axes which are reduced are left in the result as dimensions with size one.

Returns

single boolean unless axis``is not ``None; else TensorArray

astype(dtype, copy=True)[source]

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

copy() text_extensions_for_pandas.array.tensor.TensorArray[source]

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property dtype: pandas.core.dtypes.base.ExtensionDtype

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property inferred_type: str

Return string describing type of TensorArray. Delegates to pandas.api.types.infer_dtype(). See docstring for more information.

Returns

string describing numpy type of this TensorArray

isna() numpy.array[source]

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property nbytes: int

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property numpy_dtype

Get the dtype of the tensor.

Returns

The numpy dtype of the backing ndarray

property numpy_ndim

Get the number of tensor dimensions.

Returns

integer for the number of dimensions

property numpy_shape

Get the shape of the tensor.

Returns

A tuple of integers for the numpy shape of the backing ndarray

take(indices: Sequence[int], allow_fill: bool = False, fill_value: Optional[Any] = None) text_extensions_for_pandas.array.tensor.TensorArray[source]

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

to_numpy(dtype=None, copy=False, na_value=NoDefault.no_default)[source]

See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

TensorElement Class: Object to represent a single tensor

class text_extensions_for_pandas.TensorElement(values: numpy.ndarray)[source]

Class representing a single element in a TensorArray, or row in a Pandas column of dtype TensorDtype. This is a light wrapper over a numpy.ndarray

Public Methods:

__init__(values)

Construct a TensorElement from an numpy.ndarray.

__repr__()

Return repr(self).

__str__()

Return str(self).

to_numpy()

Return the values of this element as a numpy.ndarray

__array__()

__add__(other)

__radd__(other)

__sub__(other)

__rsub__(other)

__mul__(other)

__rmul__(other)

__pow__(other)

__rpow__(other)

__mod__(other)

__rmod__(other)

__floordiv__(other)

__rfloordiv__(other)

__truediv__(other)

__rtruediv__(other)

__divmod__(other)

__rdivmod__(other)

__eq__(other)

Return self==value.

__ne__(other)

Return self!=value.

__lt__(other)

Return self<value.

__gt__(other)

Return self>value.

__le__(other)

Return self<=value.

__ge__(other)

Return self>=value.


to_numpy()[source]

Return the values of this element as a numpy.ndarray

Returns

numpy.ndarray