Pandas Extension Types¶

Text Extensions for Pandas includes extension types for representing spans and tensors inside Pandas DataFrames. This section describes the Python classes that implement these types.

Span Extension Type¶

The SpanDtype extension data type efficiently stores span data in a Pandas Series. Each span is represented by begin and end character offsets into a target document. We use dense NumPy arrays for efficient internal storage.

class text_extensions_for_pandas.SpanDtype[source]¶: Panda datatype for a span that represents a range of characters within a target string.

SpanArray Class: Store spans in a Pandas Series¶

class text_extensions_for_pandas.SpanArray(text: Union[str, Sequence[str], numpy.ndarray, Tuple[text_extensions_for_pandas.array.string_table.StringTable, numpy.ndarray]], begins: Union[pandas.core.series.Series, numpy.ndarray, Sequence[int]], ends: Union[pandas.core.series.Series, numpy.ndarray, Sequence[int]])[source]¶

A Pandas ExtensionArray that represents a column of character-based spans over a single target text.

Spans are represented as [begin, end) intervals, where begin and end are character offsets into the target text.

Public Data Attributes:

`dtype`	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`nbytes`	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`target_text`	"document" texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.
`document_text`	if all spans in this array cover the same document, text of that document.
`is_single_document`	True if there is at least one span in the and every span is over the same target text.
`begin`
`end`
`version`	Monotonically increasing version number that changes every time this array is modified.
`covered_text`	an array of the substrings of target_text corresponding to the spans in this array.
`normalized_covered_text`	A normalized version of the covered text of the spans in this array.
`repr_html_show_offsets`	Whether the HTML/Jupyter notebook representation of this array will contain a table of span offsets in addition to the marked-up target text.

Inherited from ExtensionArray

`dtype`	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`shape`	Return a tuple of the array dimensions.
`size`	The number of elements in the array.
`ndim`	Extension Arrays are only allowed to be 1-dimensional.
`nbytes`	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`T`

Public Methods:

`__init__`(text, begins, ends)	Factory method for creating instances of this class.
`astype`(dtype[, copy])	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__len__`()	Length of this array
`__getitem__`(item)	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__setitem__`(key, value)	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__eq__`(other)	Pandas/Numpy-style array/series comparison function.
`__ne__`(other)	Return for self != other (element-wise in-equality).
`__hash__`()	Return hash(self).
`__contains__`(item)	Return true if scalar item exists in this SpanArray.
`equals`(other)	param other A second `SpanArray`
`isna`()	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`copy`()	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`take`(indices[, allow_fill, fill_value])	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__lt__`(other)	Pandas-style array/series comparison function.
`__gt__`(other)	Return self>value.
`__le__`(other)	Return self<=value.
`__ge__`(other)	Return self>=value.
`make_array`(o)	Make a `SpanArray` object out of any of several types of input.
`split_by_document`()	return A list of slices of this SpanArray that cover single documents.
`increment_version`()	Manually increase the version counter of this array to indicate that the array's contents have changed.
`as_tuples`()	returns (begin, end) pairs as an array of tuples
`as_frame`()	Returns a dataframe representation of this column based on Python atomic types.
`same_target_text`(other)	param other Either a single span or an array of spans of the same
`overlaps`(other)	param other Either a single span or an array of spans of the same
`contains`(other)	param other Either a single span or an array of spans of the same
`__arrow_array__`([type])	Conversion of this Array to a pyarrow.ExtensionArray.

Inherited from ExtensionArray

`__getitem__`(item)	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__setitem__`(key, value)	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__len__`()	Length of this array
`__iter__`()	Iterate over elements of the array.
`__contains__`(item)	Return true if scalar item exists in this SpanArray.
`__eq__`(other)	Pandas/Numpy-style array/series comparison function.
`__ne__`(other)	Return for self != other (element-wise in-equality).
`to_numpy`([dtype, copy, na_value])	Convert to a NumPy ndarray.
`astype`(dtype[, copy])	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`isna`()	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`argsort`([ascending, kind, na_position])	Return the indices that would sort this array.
`argmin`([skipna])	Return the index of minimum value.
`argmax`([skipna])	Return the index of maximum value.
`fillna`([value, method, limit])	Fill NA/NaN values using the specified method.
`dropna`()	Return ExtensionArray without NA values.
`shift`([periods, fill_value])	Shift values by desired number.
`unique`()	Compute the ExtensionArray of unique values.
`searchsorted`(value[, side, sorter])	Find indices where elements should be inserted to maintain order.
`equals`(other)	param other A second `SpanArray`
`isin`(values)	Pointwise comparison for set containment in the given values.
`factorize`([na_sentinel])	Encode the extension array as an enumerated type.
`repeat`(repeats[, axis])	Repeat elements of a ExtensionArray.
`take`(indices[, allow_fill, fill_value])	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`copy`()	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`view`([dtype])	Return a view on the array.
`__repr__`()	Return repr(self).
`transpose`(*axes)	Return a transposed view on this array.
`ravel`([order])	Return a flattened view on this array.
`tolist`()	Return a list of the values.
`delete`(loc)
`insert`(loc, item)	Insert an item at the given position.
`__array_ufunc__`(ufunc, method, inputs, *kwargs)

Inherited from SpanOpMixin

__add__(other)

Add a pair of spans and/or span arrays.

as_frame() → pandas.core.frame.DataFrame[source]¶: Returns a dataframe representation of this column based on Python atomic types.

as_tuples() → numpy.ndarray[source]¶

Returns: (begin, end) pairs as an array of tuples

astype(dtype, copy=True)[source]¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

contains(other: Union[text_extensions_for_pandas.array.span.SpanArray, text_extensions_for_pandas.array.span.Span])[source]¶

Parameters: other – Either a single span or an array of spans of the same length as this one
Returns: Numpy array containing a boolean mask of all entries that contain the corresponding element of other

copy() → text_extensions_for_pandas.array.span.SpanArray[source]¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property covered_text: numpy.ndarray¶

an array of the substrings of target_text corresponding to the spans in this array.

Type: return

property document_text: Optional[str]¶

if all spans in this array cover the same document, text of that document. Raises a ValueError if the array is empty or if the Spans in this array cover more than one document.

Type: return

property dtype: pandas.core.dtypes.base.ExtensionDtype¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

equals(other: text_extensions_for_pandas.array.span.SpanArray)[source]¶

Parameters: other – A second SpanArray
Returns: True if both arrays have the same target texts (can be a different string object with the same contents) and the same spans in the same order.

increment_version()[source]¶: Manually increase the version counter of this array to indicate that the array’s contents have changed. Also invalidates any internal cached data derived from the array’s state.

property is_single_document: bool¶

True if there is at least one span in the and every span is over the same target text.

Type: return

isna() → numpy.array[source]¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

classmethod make_array(o) → text_extensions_for_pandas.array.span.SpanArray[source]¶

Make a SpanArray object out of any of several types of input.

Parameters: o – a SpanArray object represented as a pd.Series, a list of Span objects, or maybe just an actual SpanArray (or TokenSpanArray) object.
Returns: SpanArray version of o, which may be a pointer to o or one of its fields.

property nbytes: int¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property normalized_covered_text: numpy.ndarray¶

A normalized version of the covered text of the spans in this array. Currently “normalized” means “lowercase”.

Type: return

overlaps(other: Union[text_extensions_for_pandas.array.span.SpanArray, text_extensions_for_pandas.array.span.Span])[source]¶

Parameters: other – Either a single span or an array of spans of the same length as this one
Returns: Numpy array containing a boolean mask of all entries that overlap the corresponding element of other

property repr_html_show_offsets¶

Whether the HTML/Jupyter notebook representation of this array will contain a table of span offsets in addition to the marked-up target text.

Type: @returns

same_target_text(other: Union[text_extensions_for_pandas.array.span.SpanArray, text_extensions_for_pandas.array.span.Span])[source]¶

Parameters: other – Either a single span or an array of spans of the same length as this one
Returns: Numpy array containing a boolean mask of all entries that have the same target text. Two spans with target text of None are considered to have the same target text.

split_by_document() → List[text_extensions_for_pandas.array.span.SpanArray][source]¶

Returns: A list of slices of this SpanArray that cover single documents.

take(indices: Sequence[int], allow_fill: bool = False, fill_value: Optional[Any] = None) → text_extensions_for_pandas.array.span.SpanArray[source]¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property target_text: numpy.ndarray¶

“document” texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.

Type: return

property version: int¶

Monotonically increasing version number that changes every time this array is modified. NOTE: This number might not change if a caller obtains a pointer to an internal array and modifies it. Callers who perform such modifications should call increment_version()

Type: return

Span Class: Object to represent a single span¶

class text_extensions_for_pandas.Span(text: str, begin: int, end: int)[source]¶

Python object representation of a single span with character offsets; that is, a single row of a SpanArray.

An offset of Span.NULL_OFFSET_VALUE (currently -1) indicates “not a span” in the sense that NaN is “not a number”.

Most of the methods and properties of this class are single-span versions of the: eponymous methods in SpanArray. See that class for API documentation.

Public Data Attributes:

`NULL_OFFSET_VALUE`
`begin`
`end`
`target_text`
`covered_text`	Returns the substring of self.target_text that this Span represents.

Public Methods:

`__init__`(text, begin, end)	param text target document text on which the span is defined
`__repr__`()	Return repr(self).
`__eq__`(other)	Return self==value.
`__hash__`()	Return hash(self).
`__lt__`(other)	span1 < span2 if span1.end <= span2.begin and both spans are over the same target text
`__gt__`(other)	Return self>value.
`__le__`(other)	Return self<=value.
`__ge__`(other)	Return self>=value.
`overlaps`(other)	param other Another Span or TokenSpan
`contains`(other)	param other Another Span or TokenSpan
`context`([num_chars])	Show the location of this span in the context of the target string.

Inherited from SpanOpMixin

__add__(other)

Add a pair of spans and/or span arrays.

contains(other: text_extensions_for_pandas.array.span.Span)[source]¶

Parameters: other – Another Span or TokenSpan
Returns: True if other is entirely within the bounds of this span. Also True if a zero-length span is contained within the other.

context(num_chars: int = 40) → str[source]¶

Show the location of this span in the context of the target string.

Parameters: num_chars – How many characters on either side to display
Returns: A string in the form: `<text before>[<text inside>]<text after>` describing the text within and around the span.

property covered_text¶: Returns the substring of self.target_text that this Span represents.

overlaps(other: text_extensions_for_pandas.array.span.Span)[source]¶

Parameters: other – Another Span or TokenSpan
Returns: True if the two spans overlap. Also True if a zero-length span is contained within the other.

Token-Based Span Extension Type¶

The TokenSpanDtype extension data type is similar to SpanDtype, except that it represents spans using begin and end offsets into the tokens of a target document. These tokens are stored in a (shared) SpanArray object.

class text_extensions_for_pandas.TokenSpanDtype[source]¶: Pandas datatype for a span that represents a range of tokens within a target string.

TokenSpanArray Class: Store token-based spans in a Pandas Series¶

class text_extensions_for_pandas.TokenSpanArray(tokens: Union[text_extensions_for_pandas.array.span.SpanArray, Sequence[text_extensions_for_pandas.array.span.SpanArray]], begin_tokens: Optional[Union[pandas.core.series.Series, numpy.ndarray, Sequence[int]]] = None, end_tokens: Optional[Union[pandas.core.series.Series, numpy.ndarray, Sequence[int]]] = None)[source]¶

A Pandas ExtensionArray that represents a column of token-based spans over a single target text.

Spans are represented internally as [begin_token, end_token) intervals, where the properties begin_token and end_token are token offsets into the target text. As with the parent class SpanArray, the properties begin and end of a TokenSpanArray return character offsets.

Null values are encoded with begin and end offsets of TokenSpan.NULL_OFFSET_VALUE.

Fields:

self._tokens: Reference to the target string’s tokens as a SpanArray. For now, references to different SpanArray objects are treated as different even if the arrays have the same contents.
self._begin_tokens: Numpy array of integer offsets in tokens. An offset of TokenSpan.NULL_OFFSET_VALUE here indicates a null value.
self._end_tokens: Numpy array of end offsets (1 + last token in span).

Public Data Attributes:

`dtype`	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`nbytes`	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`tokens`	The tokens over which each TokenSpan in this array are defined as an ndarray of object.
`target_text`	"document" texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.
`document_text`	if all spans in this array cover the same document, text of that document.
`document_tokens`	if all spans in this array cover the same tokenization of a single document, tokens of that document.
`nulls_mask`	A boolean mask indicating which rows are nulls
`begin`	the character offsets of the span begins.
`end`	the character offsets of the span ends.
`begin_token`	Token offsets of the span begins; that is, the index of the first token in each span.
`end_token`	Token offsets of the span ends.
`covered_text`	Returns an array of the substrings of target_text corresponding to the spans in this array.
`is_single_document`	True if every span in this array is over the same target text or if there are zero spans in this array.
`is_single_tokenization`	True if every span in this array is over the same tokenization of the same target text or if there are zero spans in this array.

Inherited from SpanArray

`dtype`	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`nbytes`	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`target_text`	"document" texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.
`document_text`	if all spans in this array cover the same document, text of that document.
`is_single_document`	True if every span in this array is over the same target text or if there are zero spans in this array.
`begin`	the character offsets of the span begins.
`end`	the character offsets of the span ends.
`version`	Monotonically increasing version number that changes every time this array is modified.
`covered_text`	Returns an array of the substrings of target_text corresponding to the spans in this array.
`normalized_covered_text`	A normalized version of the covered text of the spans in this array.
`repr_html_show_offsets`	Whether the HTML/Jupyter notebook representation of this array will contain a table of span offsets in addition to the marked-up target text.

Inherited from ExtensionArray

`dtype`	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`shape`	Return a tuple of the array dimensions.
`size`	The number of elements in the array.
`ndim`	Extension Arrays are only allowed to be 1-dimensional.
`nbytes`	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`T`

Public Methods:

`__init__`(tokens[, begin_tokens, end_tokens])	param tokens Character-level span information about the underlying
`from_char_offsets`(tokens)	Convenience factory method for wrapping the character-level spans of a series of tokens into single-token token-based spans.
`astype`(dtype[, copy])	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__len__`()	Length of this array
`__getitem__`(item)	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__setitem__`(key, value)	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__eq__`(other)	Pandas/Numpy-style array/series comparison function.
`__hash__`()	Return hash(self).
`__contains__`(item)	Return true if scalar item exists in this TokenSpanArray.
`__le__`(other)	Return self<=value.
`__ge__`(other)	Return self>=value.
`isna`()	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`copy`()	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`take`(indices[, allow_fill, fill_value])	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`make_array`(o)	Make a `TokenSpanArray` object out of any of several types of input.
`align_to_tokens`(tokens, spans)	Align a set of character or token-based spans to a specified tokenization, producing a TokenSpanArray of token-based spans.
`as_tuples`()	Returns (begin, end) pairs as an array of tuples
`increment_version`()	Override parent class's version of this function to also clear out data cached in the subclass.
`as_frame`()	Returns a dataframe representation of this column based on Python atomic types.
`same_target_text`(other)	param other Either a single span or an array of spans of the same
`same_tokens`(other)	param other Either a single span or an array of spans of the same
`split_by_document`()	return A list of slices of this SpanArray that cover single documents.
`__arrow_array__`([type])	Conversion of this Array to a pyarrow.ExtensionArray.

Inherited from SpanArray

`__init__`(tokens[, begin_tokens, end_tokens])	param tokens Character-level span information about the underlying
`astype`(dtype[, copy])	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__len__`()	Length of this array
`__getitem__`(item)	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__setitem__`(key, value)	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__eq__`(other)	Pandas/Numpy-style array/series comparison function.
`__ne__`(other)	Return for self != other (element-wise in-equality).
`__hash__`()	Return hash(self).
`__contains__`(item)	Return true if scalar item exists in this TokenSpanArray.
`equals`(other)	param other A second `SpanArray`
`isna`()	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`copy`()	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`take`(indices[, allow_fill, fill_value])	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__lt__`(other)	Pandas-style array/series comparison function.
`__gt__`(other)	Return self>value.
`__le__`(other)	Return self<=value.
`__ge__`(other)	Return self>=value.
`make_array`(o)	Make a `TokenSpanArray` object out of any of several types of input.
`split_by_document`()	return A list of slices of this SpanArray that cover single documents.
`increment_version`()	Override parent class's version of this function to also clear out data cached in the subclass.
`as_tuples`()	Returns (begin, end) pairs as an array of tuples
`as_frame`()	Returns a dataframe representation of this column based on Python atomic types.
`same_target_text`(other)	param other Either a single span or an array of spans of the same
`overlaps`(other)	param other Either a single span or an array of spans of the same
`contains`(other)	param other Either a single span or an array of spans of the same
`__arrow_array__`([type])	Conversion of this Array to a pyarrow.ExtensionArray.

Inherited from ExtensionArray

`__getitem__`(item)	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__setitem__`(key, value)	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`__len__`()	Length of this array
`__iter__`()	Iterate over elements of the array.
`__contains__`(item)	Return true if scalar item exists in this TokenSpanArray.
`__eq__`(other)	Pandas/Numpy-style array/series comparison function.
`__ne__`(other)	Return for self != other (element-wise in-equality).
`to_numpy`([dtype, copy, na_value])	Convert to a NumPy ndarray.
`astype`(dtype[, copy])	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`isna`()	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`argsort`([ascending, kind, na_position])	Return the indices that would sort this array.
`argmin`([skipna])	Return the index of minimum value.
`argmax`([skipna])	Return the index of maximum value.
`fillna`([value, method, limit])	Fill NA/NaN values using the specified method.
`dropna`()	Return ExtensionArray without NA values.
`shift`([periods, fill_value])	Shift values by desired number.
`unique`()	Compute the ExtensionArray of unique values.
`searchsorted`(value[, side, sorter])	Find indices where elements should be inserted to maintain order.
`equals`(other)	param other A second `SpanArray`
`isin`(values)	Pointwise comparison for set containment in the given values.
`factorize`([na_sentinel])	Encode the extension array as an enumerated type.
`repeat`(repeats[, axis])	Repeat elements of a ExtensionArray.
`take`(indices[, allow_fill, fill_value])	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`copy`()	See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
`view`([dtype])	Return a view on the array.
`__repr__`()	Return repr(self).
`transpose`(*axes)	Return a transposed view on this array.
`ravel`([order])	Return a flattened view on this array.
`tolist`()	Return a list of the values.
`delete`(loc)
`insert`(loc, item)	Insert an item at the given position.
`__array_ufunc__`(ufunc, method, inputs, *kwargs)

Inherited from TokenSpanOpMixin

__add__(other)

Add a pair of spans and/or span arrays.

Inherited from SpanOpMixin

__add__(other)

Add a pair of spans and/or span arrays.

classmethod align_to_tokens(tokens: Any, spans: Any)[source]¶

Align a set of character or token-based spans to a specified tokenization, producing a TokenSpanArray of token-based spans.

Parameters

tokens – The tokens to align to, as any type that SpanArray.make_array() accepts.
spans – The spans to align. These spans must all target the same text as tokens.

Returns

An array of TokenSpan objects aligned to the tokens of tokens. Raises ValueError if any of the spans in spans doesn’t start and end on a token boundary.

as_frame() → pandas.core.frame.DataFrame[source]¶: Returns a dataframe representation of this column based on Python atomic types.

as_tuples() → numpy.ndarray[source]¶: Returns (begin, end) pairs as an array of tuples

astype(dtype, copy=True)[source]¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property begin: numpy.ndarray¶

the character offsets of the span begins.

Type: return

property begin_token: numpy.ndarray¶

Token offsets of the span begins; that is, the index of the first token in each span.

Type: return

copy() → text_extensions_for_pandas.array.token_span.TokenSpanArray[source]¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property covered_text: numpy.ndarray¶: Returns an array of the substrings of target_text corresponding to the spans in this array.

property document_text: Optional[str]¶

if all spans in this array cover the same document, text of that document. Raises a ValueError if the array is empty or if the Spans in this array cover more than one document.

Type: return

property document_tokens: Optional[text_extensions_for_pandas.array.span.SpanArray]¶

if all spans in this array cover the same tokenization of a single document, tokens of that document. Raises a ValueError if the array is empty or if the Spans in this array cover more than one document.

Type: return

property dtype: pandas.core.dtypes.base.ExtensionDtype¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property end: numpy.ndarray¶

the character offsets of the span ends.

Type: return

property end_token: numpy.ndarray¶

Token offsets of the span ends. That is, 1 + last token present in the span, for each span in the column.

Type: return

static from_char_offsets(tokens: Any) → text_extensions_for_pandas.array.token_span.TokenSpanArray[source]¶

Convenience factory method for wrapping the character-level spans of a series of tokens into single-token token-based spans.

Parameters: tokens – character-based offsets of the tokens, as any type that SpanArray.make_array() understands.
Returns: A TokenSpanArray containing single-token spans for each of the tokens in tokens.

increment_version()[source]¶: Override parent class’s version of this function to also clear out data cached in the subclass.

property is_single_document: bool¶

True if every span in this array is over the same target text or if there are zero spans in this array.

Type: return

property is_single_tokenization: bool¶

True if every span in this array is over the same tokenization of the same target text or if there are zero spans in this array.

Type: return

isna() → numpy.array[source]¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

classmethod make_array(o) → text_extensions_for_pandas.array.token_span.TokenSpanArray[source]¶

Make a TokenSpanArray object out of any of several types of input.

Parameters: o – a TokenSpanArray object represented as a pd.Series, a list of TokenSpan objects, or an actual TokenSpanArray object.
Returns: TokenSpanArray version of o, which may be a pointer to o or one of its fields.

property nbytes: int¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property nulls_mask: numpy.ndarray¶

A boolean mask indicating which rows are nulls

Type: return

same_target_text(other: Union[text_extensions_for_pandas.array.span.SpanArray, text_extensions_for_pandas.array.span.Span])[source]¶

Parameters: other – Either a single span or an array of spans of the same length as this one
Returns: Numpy array containing a boolean mask of all entries that have the same target text. Two spans with target text of None are considered to have the same target text.

same_tokens(other: Union[text_extensions_for_pandas.array.token_span.TokenSpanArray, text_extensions_for_pandas.array.token_span.TokenSpan])[source]¶

Parameters: other – Either a single span or an array of spans of the same length as this one. Must be token-based.
Returns: Numpy array containing a boolean mask of all entries that are over the same tokenization of the same target text. Two spans with target text of None are considered to have the same target text.

split_by_document() → List[text_extensions_for_pandas.array.span.SpanArray][source]¶

Returns: A list of slices of this SpanArray that cover single documents.

take(indices: Sequence[int], allow_fill: bool = False, fill_value: Optional[Any] = None) → text_extensions_for_pandas.array.token_span.TokenSpanArray[source]¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property target_text: numpy.ndarray¶

“document” texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.

Type: return

property tokens: numpy.ndarray¶

The tokens over which each TokenSpan in this array are defined as an ndarray of object.

Type: return

TokenSpan Class: Object to represent a single token-based span¶

class text_extensions_for_pandas.TokenSpan(tokens: Any, begin_token: int, end_token: int)[source]¶

Python object representation of a single span with token offsets; that is, a single row of a TokenSpanArray.

This class is also a subclass of Span and can return character-level information.

An offset of TokenSpan.NULL_OFFSET_VALUE (currently -1) indicates “not a span” in the sense that NaN is “not a number”.

Public Data Attributes:

`USE_TOKEN_OFFSETS_IN_REPR`
`tokens`
`begin_token`
`end_token`

Inherited from Span

`NULL_OFFSET_VALUE`
`begin`
`end`
`target_text`
`covered_text`	Returns the substring of self.target_text that this Span represents.

Public Methods:

`__init__`(tokens, begin_token, end_token)	param tokens Tokenization information about the document, including
`make_null`(tokens)	Convenience method for building null spans.
`__repr__`()	Return repr(self).
`__eq__`(other)	Return self==value.
`__hash__`()	Return hash(self).
`__lt__`(other)	span1 < span2 if span1.end <= span2.begin

Inherited from Span

`__init__`(tokens, begin_token, end_token)	param tokens Tokenization information about the document, including
`__repr__`()	Return repr(self).
`__eq__`(other)	Return self==value.
`__hash__`()	Return hash(self).
`__lt__`(other)	span1 < span2 if span1.end <= span2.begin
`__gt__`(other)	Return self>value.
`__le__`(other)	Return self<=value.
`__ge__`(other)	Return self>=value.
`overlaps`(other)	param other Another Span or TokenSpan
`contains`(other)	param other Another Span or TokenSpan
`context`([num_chars])	Show the location of this span in the context of the target string.

Inherited from TokenSpanOpMixin

__add__(other)

Add a pair of spans and/or span arrays.

Inherited from SpanOpMixin

__add__(other)

Add a pair of spans and/or span arrays.

classmethod make_null(tokens)[source]¶: Convenience method for building null spans. :param tokens: Tokens of the target string :return: A null span over the indicated tokens

Tensor Extension Type¶

The TensorDtype extension data type is efficiently stores tensors in the rows of a Pandas Series. For efficiency, we store all of the tensors in a Series in a single NumPy array.

class text_extensions_for_pandas.TensorDtype[source]¶: Pandas data type for a column of tensors with the same shape.

TensorArray Class: Store tensors in a Pandas Series¶

class text_extensions_for_pandas.TensorArray(values: Union[numpy.ndarray, Sequence[Union[numpy.ndarray, text_extensions_for_pandas.array.tensor.TensorElement]], text_extensions_for_pandas.array.tensor.TensorElement, Any])[source]¶

A Pandas ExtensionArray that represents a column of numpy.ndarray objects, or tensors, where the outer dimension is the count of tensors in the column. Each tensor must have the same shape.

Public Data Attributes:

`dtype`	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`inferred_type`	Return string describing type of TensorArray.
`nbytes`	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`numpy_dtype`	Get the dtype of the tensor.
`numpy_ndim`	Get the number of tensor dimensions.
`numpy_shape`	Get the shape of the tensor.

Inherited from ExtensionArray

`dtype`	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`shape`	Return a tuple of the array dimensions.
`size`	The number of elements in the array.
`ndim`	Extension Arrays are only allowed to be 1-dimensional.
`nbytes`	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`T`

Public Methods:

`__init__`(values)	param values A `numpy.ndarray` or sequence of
`isna`()	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`copy`()	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`take`(indices[, allow_fill, fill_value])	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`to_numpy`([dtype, copy, na_value])	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`astype`(dtype[, copy])	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`any`([axis, out, keepdims])	Test whether any array element along a given axis evaluates to `True`.
`all`([axis, out, keepdims])	Test whether all array elements along a given axis evaluate to `True`.
`__len__`()	Length of this array
`__getitem__`(item)	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`__setitem__`(key, value)	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`__contains__`(item)	Return for item in self.
`__repr__`()	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`__str__`()	Return str(self).
`__array__`([dtype])	Interface to return the backing tensor as a numpy array with optional dtype.
`__array_ufunc__`(ufunc, method, inputs, *kwargs)	Interface to handle numpy ufuncs that will accept TensorArray as input, and wrap the output back as another TensorArray.
`__arrow_array__`([type])
`__add__`(other)
`__radd__`(other)
`__sub__`(other)
`__rsub__`(other)
`__mul__`(other)
`__rmul__`(other)
`__pow__`(other)
`__rpow__`(other)
`__mod__`(other)
`__rmod__`(other)
`__floordiv__`(other)
`__rfloordiv__`(other)
`__truediv__`(other)
`__rtruediv__`(other)
`__divmod__`(other)
`__rdivmod__`(other)
`__eq__`(other)	Return for self == other (element-wise equality).
`__ne__`(other)	Return for self != other (element-wise in-equality).
`__lt__`(other)	Return self<value.
`__gt__`(other)	Return self>value.
`__le__`(other)	Return self<=value.
`__ge__`(other)	Return self>=value.

Inherited from ExtensionArray

`__getitem__`(item)	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`__setitem__`(key, value)	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`__len__`()	Length of this array
`__iter__`()	Iterate over elements of the array.
`__contains__`(item)	Return for item in self.
`__eq__`(other)	Return for self == other (element-wise equality).
`__ne__`(other)	Return for self != other (element-wise in-equality).
`to_numpy`([dtype, copy, na_value])	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`astype`(dtype[, copy])	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`isna`()	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`argsort`([ascending, kind, na_position])	Return the indices that would sort this array.
`argmin`([skipna])	Return the index of minimum value.
`argmax`([skipna])	Return the index of maximum value.
`fillna`([value, method, limit])	Fill NA/NaN values using the specified method.
`dropna`()	Return ExtensionArray without NA values.
`shift`([periods, fill_value])	Shift values by desired number.
`unique`()	Compute the ExtensionArray of unique values.
`searchsorted`(value[, side, sorter])	Find indices where elements should be inserted to maintain order.
`equals`(other)	Return if another array is equivalent to this array.
`isin`(values)	Pointwise comparison for set containment in the given values.
`factorize`([na_sentinel])	Encode the extension array as an enumerated type.
`repeat`(repeats[, axis])	Repeat elements of a ExtensionArray.
`take`(indices[, allow_fill, fill_value])	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`copy`()	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`view`([dtype])	Return a view on the array.
`__repr__`()	See docstring in `ExtensionArray` class in `pandas/core/arrays/base.py` for information about this method.
`transpose`(*axes)	Return a transposed view on this array.
`ravel`([order])	Return a flattened view on this array.
`tolist`()	Return a list of the values.
`delete`(loc)
`insert`(loc, item)	Insert an item at the given position.
`__array_ufunc__`(ufunc, method, inputs, *kwargs)	Interface to handle numpy ufuncs that will accept TensorArray as input, and wrap the output back as another TensorArray.

all(axis=None, out=None, keepdims=False)[source]¶

Test whether all array elements along a given axis evaluate to True.

Parameters

axis – Axis or axes along which a logical AND reduction is performed.
out – Alternate output array in which to place the result.
keepdims – If this is set to True, the axes which are reduced are left in the result as dimensions with size one.

Returns

single boolean unless axis is not None; else TensorArray

any(axis=None, out=None, keepdims=False)[source]¶

Test whether any array element along a given axis evaluates to True.

See numpy.any() documentation for more information https://numpy.org/doc/stable/reference/generated/numpy.any.html#numpy.any

Parameters

axis – Axis or axes along which a logical OR reduction is performed.
out – Alternate output array in which to place the result.
keepdims – If this is set to True, the axes which are reduced are left in the result as dimensions with size one.

Returns

single boolean unless axis``is not ``None; else TensorArray

astype(dtype, copy=True)[source]¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

copy() → text_extensions_for_pandas.array.tensor.TensorArray[source]¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property dtype: pandas.core.dtypes.base.ExtensionDtype¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property inferred_type: str¶

Return string describing type of TensorArray. Delegates to pandas.api.types.infer_dtype(). See docstring for more information.

Returns: string describing numpy type of this TensorArray

isna() → numpy.array[source]¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property nbytes: int¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

property numpy_dtype¶

Get the dtype of the tensor.

Returns: The numpy dtype of the backing ndarray

property numpy_ndim¶

Get the number of tensor dimensions.

Returns: integer for the number of dimensions

property numpy_shape¶

Get the shape of the tensor.

Returns: A tuple of integers for the numpy shape of the backing ndarray

take(indices: Sequence[int], allow_fill: bool = False, fill_value: Optional[Any] = None) → text_extensions_for_pandas.array.tensor.TensorArray[source]¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

to_numpy(dtype=None, copy=False, na_value=NoDefault.no_default)[source]¶: See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.

TensorElement Class: Object to represent a single tensor¶

class text_extensions_for_pandas.TensorElement(values: numpy.ndarray)[source]¶

Class representing a single element in a TensorArray, or row in a Pandas column of dtype TensorDtype. This is a light wrapper over a numpy.ndarray

Public Methods:

`__init__`(values)	Construct a TensorElement from an numpy.ndarray.
`__repr__`()	Return repr(self).
`__str__`()	Return str(self).
`to_numpy`()	Return the values of this element as a numpy.ndarray
`__array__`()
`__add__`(other)
`__radd__`(other)
`__sub__`(other)
`__rsub__`(other)
`__mul__`(other)
`__rmul__`(other)
`__pow__`(other)
`__rpow__`(other)
`__mod__`(other)
`__rmod__`(other)
`__floordiv__`(other)
`__rfloordiv__`(other)
`__truediv__`(other)
`__rtruediv__`(other)
`__divmod__`(other)
`__rdivmod__`(other)
`__eq__`(other)	Return self==value.
`__ne__`(other)	Return self!=value.
`__lt__`(other)	Return self<value.
`__gt__`(other)	Return self>value.
`__le__`(other)	Return self<=value.
`__ge__`(other)	Return self>=value.

to_numpy()[source]¶

Return the values of this element as a numpy.ndarray

Returns: numpy.ndarray