Pandas Extension Types¶
Text Extensions for Pandas includes extension types for representing spans and tensors inside Pandas DataFrames. This section describes the Python classes that implement these types.
Span Extension Type¶
The SpanDtype
extension data type efficiently stores span data
in a Pandas Series.
Each span is represented by begin and end character offsets
into a target document.
We use dense NumPy arrays for efficient internal storage.
- class text_extensions_for_pandas.SpanDtype[source]¶
Panda datatype for a span that represents a range of characters within a target string.
SpanArray Class: Store spans in a Pandas Series¶
- class text_extensions_for_pandas.SpanArray(text: Union[str, Sequence[str], numpy.ndarray, Tuple[text_extensions_for_pandas.array.string_table.StringTable, numpy.ndarray]], begins: Union[pandas.core.series.Series, numpy.ndarray, Sequence[int]], ends: Union[pandas.core.series.Series, numpy.ndarray, Sequence[int]])[source]¶
A Pandas ExtensionArray that represents a column of character-based spans over a single target text.
Spans are represented as [begin, end) intervals, where begin and end are character offsets into the target text.
Public Data Attributes:
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
"document" texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.
if all spans in this array cover the same document, text of that document.
True if there is at least one span in the and every span is over the same target text.
begin
end
Monotonically increasing version number that changes every time this array is modified.
an array of the substrings of target_text corresponding to the spans in this array.
A normalized version of the covered text of the spans in this array.
Whether the HTML/Jupyter notebook representation of this array will contain a table of span offsets in addition to the marked-up target text.
Inherited from
ExtensionArray
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
shape
Return a tuple of the array dimensions.
size
The number of elements in the array.
ndim
Extension Arrays are only allowed to be 1-dimensional.
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
T
Public Methods:
__init__
(text, begins, ends)Factory method for creating instances of this class.
astype
(dtype[, copy])See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__len__
()Length of this array
__getitem__
(item)See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__setitem__
(key, value)See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__eq__
(other)Pandas/Numpy-style array/series comparison function.
__ne__
(other)Return for self != other (element-wise in-equality).
__hash__
()Return hash(self).
__contains__
(item)Return true if scalar item exists in this SpanArray.
equals
(other)- param other
A second
SpanArray
isna
()See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
copy
()See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
take
(indices[, allow_fill, fill_value])See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__lt__
(other)Pandas-style array/series comparison function.
__gt__
(other)Return self>value.
__le__
(other)Return self<=value.
__ge__
(other)Return self>=value.
make_array
(o)Make a
SpanArray
object out of any of several types of input.- return
A list of slices of this SpanArray that cover single documents.
Manually increase the version counter of this array to indicate that the array's contents have changed.
- returns
(begin, end) pairs as an array of tuples
as_frame
()Returns a dataframe representation of this column based on Python atomic types.
same_target_text
(other)- param other
Either a single span or an array of spans of the same
overlaps
(other)- param other
Either a single span or an array of spans of the same
contains
(other)- param other
Either a single span or an array of spans of the same
__arrow_array__
([type])Conversion of this Array to a pyarrow.ExtensionArray.
Inherited from
ExtensionArray
__getitem__
(item)See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__setitem__
(key, value)See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__len__
()Length of this array
__iter__
()Iterate over elements of the array.
__contains__
(item)Return true if scalar item exists in this SpanArray.
__eq__
(other)Pandas/Numpy-style array/series comparison function.
__ne__
(other)Return for self != other (element-wise in-equality).
to_numpy
([dtype, copy, na_value])Convert to a NumPy ndarray.
astype
(dtype[, copy])See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
isna
()See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
argsort
([ascending, kind, na_position])Return the indices that would sort this array.
argmin
([skipna])Return the index of minimum value.
argmax
([skipna])Return the index of maximum value.
fillna
([value, method, limit])Fill NA/NaN values using the specified method.
dropna
()Return ExtensionArray without NA values.
shift
([periods, fill_value])Shift values by desired number.
unique
()Compute the ExtensionArray of unique values.
searchsorted
(value[, side, sorter])Find indices where elements should be inserted to maintain order.
equals
(other)- param other
A second
SpanArray
isin
(values)Pointwise comparison for set containment in the given values.
factorize
([na_sentinel])Encode the extension array as an enumerated type.
repeat
(repeats[, axis])Repeat elements of a ExtensionArray.
take
(indices[, allow_fill, fill_value])See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
copy
()See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
view
([dtype])Return a view on the array.
__repr__
()Return repr(self).
transpose
(*axes)Return a transposed view on this array.
ravel
([order])Return a flattened view on this array.
delete
(loc)Inherited from
SpanOpMixin
__add__
(other)Add a pair of spans and/or span arrays.
- as_frame() pandas.core.frame.DataFrame [source]¶
Returns a dataframe representation of this column based on Python atomic types.
- astype(dtype, copy=True)[source]¶
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
- contains(other: Union[text_extensions_for_pandas.array.span.SpanArray, text_extensions_for_pandas.array.span.Span])[source]¶
- Parameters
other – Either a single span or an array of spans of the same length as this one
- Returns
Numpy array containing a boolean mask of all entries that contain the corresponding element of other
- copy() text_extensions_for_pandas.array.span.SpanArray [source]¶
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
- property covered_text: numpy.ndarray¶
an array of the substrings of target_text corresponding to the spans in this array.
- Type
return
- property document_text: Optional[str]¶
if all spans in this array cover the same document, text of that document. Raises a
ValueError
if the array is empty or if the Spans in this array cover more than one document.- Type
return
- property dtype: pandas.core.dtypes.base.ExtensionDtype¶
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
- equals(other: text_extensions_for_pandas.array.span.SpanArray)[source]¶
- Parameters
other – A second
SpanArray
- Returns
True
if both arrays have the same target texts (can be a different string object with the same contents) and the same spans in the same order.
- increment_version()[source]¶
Manually increase the version counter of this array to indicate that the array’s contents have changed. Also invalidates any internal cached data derived from the array’s state.
- property is_single_document: bool¶
True if there is at least one span in the and every span is over the same target text.
- Type
return
- isna() numpy.array [source]¶
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
- classmethod make_array(o) text_extensions_for_pandas.array.span.SpanArray [source]¶
Make a
SpanArray
object out of any of several types of input.- Parameters
o – a
SpanArray
object represented as apd.Series
, a list ofSpan
objects, or maybe just an actualSpanArray
(orTokenSpanArray
) object.- Returns
SpanArray
version ofo
, which may be a pointer too
or one of its fields.
- property nbytes: int¶
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
- property normalized_covered_text: numpy.ndarray¶
A normalized version of the covered text of the spans in this array. Currently “normalized” means “lowercase”.
- Type
return
- overlaps(other: Union[text_extensions_for_pandas.array.span.SpanArray, text_extensions_for_pandas.array.span.Span])[source]¶
- Parameters
other – Either a single span or an array of spans of the same length as this one
- Returns
Numpy array containing a boolean mask of all entries that overlap the corresponding element of other
- property repr_html_show_offsets¶
Whether the HTML/Jupyter notebook representation of this array will contain a table of span offsets in addition to the marked-up target text.
- Type
@returns
- same_target_text(other: Union[text_extensions_for_pandas.array.span.SpanArray, text_extensions_for_pandas.array.span.Span])[source]¶
- Parameters
other – Either a single span or an array of spans of the same length as this one
- Returns
Numpy array containing a boolean mask of all entries that have the same target text. Two spans with target text of None are considered to have the same target text.
- split_by_document() List[text_extensions_for_pandas.array.span.SpanArray] [source]¶
- Returns
A list of slices of this SpanArray that cover single documents.
- take(indices: Sequence[int], allow_fill: bool = False, fill_value: Optional[Any] = None) text_extensions_for_pandas.array.span.SpanArray [source]¶
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
- property target_text: numpy.ndarray¶
“document” texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.
- Type
return
- property version: int¶
Monotonically increasing version number that changes every time this array is modified. NOTE: This number might not change if a caller obtains a pointer to an internal array and modifies it. Callers who perform such modifications should call increment_version()
- Type
return
Span Class: Object to represent a single span¶
- class text_extensions_for_pandas.Span(text: str, begin: int, end: int)[source]¶
Python object representation of a single span with character offsets; that is, a single row of a SpanArray.
An offset of Span.NULL_OFFSET_VALUE (currently -1) indicates “not a span” in the sense that NaN is “not a number”.
- Most of the methods and properties of this class are single-span versions of the
eponymous methods in
SpanArray
. See that class for API documentation.
Public Data Attributes:
NULL_OFFSET_VALUE
begin
end
target_text
Returns the substring of self.target_text that this Span represents.
Public Methods:
__init__
(text, begin, end)- param text
target document text on which the span is defined
__repr__
()Return repr(self).
__eq__
(other)Return self==value.
__hash__
()Return hash(self).
__lt__
(other)span1 < span2 if span1.end <= span2.begin and both spans are over the same target text
__gt__
(other)Return self>value.
__le__
(other)Return self<=value.
__ge__
(other)Return self>=value.
overlaps
(other)- param other
Another Span or TokenSpan
contains
(other)- param other
Another Span or TokenSpan
context
([num_chars])Show the location of this span in the context of the target string.
Inherited from
SpanOpMixin
__add__
(other)Add a pair of spans and/or span arrays.
- contains(other: text_extensions_for_pandas.array.span.Span)[source]¶
- Parameters
other – Another Span or TokenSpan
- Returns
True if other is entirely within the bounds of this span. Also True if a zero-length span is contained within the other.
- context(num_chars: int = 40) str [source]¶
Show the location of this span in the context of the target string.
- Parameters
num_chars – How many characters on either side to display
- Returns
A string in the form:
`<text before>[<text inside>]<text after>`
describing the text within and around the span.
- property covered_text¶
Returns the substring of self.target_text that this Span represents.
- overlaps(other: text_extensions_for_pandas.array.span.Span)[source]¶
- Parameters
other – Another Span or TokenSpan
- Returns
True if the two spans overlap. Also True if a zero-length span is contained within the other.
Token-Based Span Extension Type¶
The TokenSpanDtype
extension data type is similar to
SpanDtype
, except that it represents spans using
begin and end offsets into the tokens of a target document.
These tokens are stored in a (shared) SpanArray
object.
- class text_extensions_for_pandas.TokenSpanDtype[source]¶
Pandas datatype for a span that represents a range of tokens within a target string.
TokenSpanArray Class: Store token-based spans in a Pandas Series¶
- class text_extensions_for_pandas.TokenSpanArray(tokens: Union[text_extensions_for_pandas.array.span.SpanArray, Sequence[text_extensions_for_pandas.array.span.SpanArray]], begin_tokens: Optional[Union[pandas.core.series.Series, numpy.ndarray, Sequence[int]]] = None, end_tokens: Optional[Union[pandas.core.series.Series, numpy.ndarray, Sequence[int]]] = None)[source]¶
A Pandas
ExtensionArray
that represents a column of token-based spans over a single target text.Spans are represented internally as
[begin_token, end_token)
intervals, where the propertiesbegin_token
andend_token
are token offsets into the target text. As with the parent classSpanArray
, the propertiesbegin
andend
of aTokenSpanArray
return character offsets.Null values are encoded with begin and end offsets of
TokenSpan.NULL_OFFSET_VALUE
.Fields:
self._tokens
: Reference to the target string’s tokens as a SpanArray. For now, references to different SpanArray objects are treated as different even if the arrays have the same contents.self._begin_tokens
: Numpy array of integer offsets in tokens. An offset of TokenSpan.NULL_OFFSET_VALUE here indicates a null value.self._end_tokens
: Numpy array of end offsets (1 + last token in span).
Public Data Attributes:
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
The tokens over which each TokenSpan in this array are defined as an ndarray of object.
"document" texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.
if all spans in this array cover the same document, text of that document.
if all spans in this array cover the same tokenization of a single document, tokens of that document.
A boolean mask indicating which rows are nulls
the character offsets of the span begins.
the character offsets of the span ends.
Token offsets of the span begins; that is, the index of the first token in each span.
Token offsets of the span ends.
Returns an array of the substrings of target_text corresponding to the spans in this array.
True if every span in this array is over the same target text or if there are zero spans in this array.
True if every span in this array is over the same tokenization of the same target text or if there are zero spans in this array.
Inherited from
SpanArray
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
"document" texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.
if all spans in this array cover the same document, text of that document.
True if every span in this array is over the same target text or if there are zero spans in this array.
the character offsets of the span begins.
the character offsets of the span ends.
version
Monotonically increasing version number that changes every time this array is modified.
Returns an array of the substrings of target_text corresponding to the spans in this array.
normalized_covered_text
A normalized version of the covered text of the spans in this array.
repr_html_show_offsets
Whether the HTML/Jupyter notebook representation of this array will contain a table of span offsets in addition to the marked-up target text.
Inherited from
ExtensionArray
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
shape
Return a tuple of the array dimensions.
size
The number of elements in the array.
ndim
Extension Arrays are only allowed to be 1-dimensional.
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
T
Public Methods:
__init__
(tokens[, begin_tokens, end_tokens])- param tokens
Character-level span information about the underlying
from_char_offsets
(tokens)Convenience factory method for wrapping the character-level spans of a series of tokens into single-token token-based spans.
astype
(dtype[, copy])See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__len__
()Length of this array
__getitem__
(item)See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__setitem__
(key, value)See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__eq__
(other)Pandas/Numpy-style array/series comparison function.
__hash__
()Return hash(self).
__contains__
(item)Return true if scalar item exists in this TokenSpanArray.
__le__
(other)Return self<=value.
__ge__
(other)Return self>=value.
isna
()See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
copy
()See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
take
(indices[, allow_fill, fill_value])See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
make_array
(o)Make a
TokenSpanArray
object out of any of several types of input.align_to_tokens
(tokens, spans)Align a set of character or token-based spans to a specified tokenization, producing a TokenSpanArray of token-based spans.
Returns (begin, end) pairs as an array of tuples
Override parent class's version of this function to also clear out data cached in the subclass.
as_frame
()Returns a dataframe representation of this column based on Python atomic types.
same_target_text
(other)- param other
Either a single span or an array of spans of the same
same_tokens
(other)- param other
Either a single span or an array of spans of the same
- return
A list of slices of this SpanArray that cover single documents.
__arrow_array__
([type])Conversion of this Array to a pyarrow.ExtensionArray.
Inherited from
SpanArray
__init__
(tokens[, begin_tokens, end_tokens])- param tokens
Character-level span information about the underlying
astype
(dtype[, copy])See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__len__
()Length of this array
__getitem__
(item)See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__setitem__
(key, value)See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__eq__
(other)Pandas/Numpy-style array/series comparison function.
__ne__
(other)Return for self != other (element-wise in-equality).
__hash__
()Return hash(self).
__contains__
(item)Return true if scalar item exists in this TokenSpanArray.
equals
(other)- param other
A second
SpanArray
isna
()See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
copy
()See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
take
(indices[, allow_fill, fill_value])See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__lt__
(other)Pandas-style array/series comparison function.
__gt__
(other)Return self>value.
__le__
(other)Return self<=value.
__ge__
(other)Return self>=value.
make_array
(o)Make a
TokenSpanArray
object out of any of several types of input.- return
A list of slices of this SpanArray that cover single documents.
Override parent class's version of this function to also clear out data cached in the subclass.
Returns (begin, end) pairs as an array of tuples
as_frame
()Returns a dataframe representation of this column based on Python atomic types.
same_target_text
(other)- param other
Either a single span or an array of spans of the same
overlaps
(other)- param other
Either a single span or an array of spans of the same
contains
(other)- param other
Either a single span or an array of spans of the same
__arrow_array__
([type])Conversion of this Array to a pyarrow.ExtensionArray.
Inherited from
ExtensionArray
__getitem__
(item)See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__setitem__
(key, value)See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
__len__
()Length of this array
__iter__
()Iterate over elements of the array.
__contains__
(item)Return true if scalar item exists in this TokenSpanArray.
__eq__
(other)Pandas/Numpy-style array/series comparison function.
__ne__
(other)Return for self != other (element-wise in-equality).
to_numpy
([dtype, copy, na_value])Convert to a NumPy ndarray.
astype
(dtype[, copy])See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
isna
()See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
argsort
([ascending, kind, na_position])Return the indices that would sort this array.
argmin
([skipna])Return the index of minimum value.
argmax
([skipna])Return the index of maximum value.
fillna
([value, method, limit])Fill NA/NaN values using the specified method.
dropna
()Return ExtensionArray without NA values.
shift
([periods, fill_value])Shift values by desired number.
unique
()Compute the ExtensionArray of unique values.
searchsorted
(value[, side, sorter])Find indices where elements should be inserted to maintain order.
equals
(other)- param other
A second
SpanArray
isin
(values)Pointwise comparison for set containment in the given values.
factorize
([na_sentinel])Encode the extension array as an enumerated type.
repeat
(repeats[, axis])Repeat elements of a ExtensionArray.
take
(indices[, allow_fill, fill_value])See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
copy
()See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
view
([dtype])Return a view on the array.
__repr__
()Return repr(self).
transpose
(*axes)Return a transposed view on this array.
ravel
([order])Return a flattened view on this array.
delete
(loc)Inherited from
TokenSpanOpMixin
__add__
(other)Add a pair of spans and/or span arrays.
Inherited from
SpanOpMixin
__add__
(other)Add a pair of spans and/or span arrays.
- classmethod align_to_tokens(tokens: Any, spans: Any)[source]¶
Align a set of character or token-based spans to a specified tokenization, producing a TokenSpanArray of token-based spans.
- Parameters
tokens – The tokens to align to, as any type that
SpanArray.make_array()
accepts.spans – The spans to align. These spans must all target the same text as
tokens
.
- Returns
An array of
TokenSpan
objects aligned to the tokens oftokens
. RaisesValueError
if any of the spans inspans
doesn’t start and end on a token boundary.
- as_frame() pandas.core.frame.DataFrame [source]¶
Returns a dataframe representation of this column based on Python atomic types.
- astype(dtype, copy=True)[source]¶
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
- property begin: numpy.ndarray¶
the character offsets of the span begins.
- Type
return
- property begin_token: numpy.ndarray¶
Token offsets of the span begins; that is, the index of the first token in each span.
- Type
return
- copy() text_extensions_for_pandas.array.token_span.TokenSpanArray [source]¶
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
- property covered_text: numpy.ndarray¶
Returns an array of the substrings of target_text corresponding to the spans in this array.
- property document_text: Optional[str]¶
if all spans in this array cover the same document, text of that document. Raises a
ValueError
if the array is empty or if the Spans in this array cover more than one document.- Type
return
- property document_tokens: Optional[text_extensions_for_pandas.array.span.SpanArray]¶
if all spans in this array cover the same tokenization of a single document, tokens of that document. Raises a ValueError if the array is empty or if the Spans in this array cover more than one document.
- Type
return
- property dtype: pandas.core.dtypes.base.ExtensionDtype¶
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
- property end: numpy.ndarray¶
the character offsets of the span ends.
- Type
return
- property end_token: numpy.ndarray¶
Token offsets of the span ends. That is, 1 + last token present in the span, for each span in the column.
- Type
return
- static from_char_offsets(tokens: Any) text_extensions_for_pandas.array.token_span.TokenSpanArray [source]¶
Convenience factory method for wrapping the character-level spans of a series of tokens into single-token token-based spans.
- Parameters
tokens – character-based offsets of the tokens, as any type that
SpanArray.make_array()
understands.- Returns
A
TokenSpanArray
containing single-token spans for each of the tokens intokens
.
- increment_version()[source]¶
Override parent class’s version of this function to also clear out data cached in the subclass.
- property is_single_document: bool¶
True if every span in this array is over the same target text or if there are zero spans in this array.
- Type
return
- property is_single_tokenization: bool¶
True if every span in this array is over the same tokenization of the same target text or if there are zero spans in this array.
- Type
return
- isna() numpy.array [source]¶
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
- classmethod make_array(o) text_extensions_for_pandas.array.token_span.TokenSpanArray [source]¶
Make a
TokenSpanArray
object out of any of several types of input.- Parameters
o – a
TokenSpanArray
object represented as apd.Series
, a list ofTokenSpan
objects, or an actualTokenSpanArray
object.- Returns
TokenSpanArray
version ofo
, which may be a pointer too
or one of its fields.
- property nbytes: int¶
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
- property nulls_mask: numpy.ndarray¶
A boolean mask indicating which rows are nulls
- Type
return
- same_target_text(other: Union[text_extensions_for_pandas.array.span.SpanArray, text_extensions_for_pandas.array.span.Span])[source]¶
- Parameters
other – Either a single span or an array of spans of the same length as this one
- Returns
Numpy array containing a boolean mask of all entries that have the same target text. Two spans with target text of None are considered to have the same target text.
- same_tokens(other: Union[text_extensions_for_pandas.array.token_span.TokenSpanArray, text_extensions_for_pandas.array.token_span.TokenSpan])[source]¶
- Parameters
other – Either a single span or an array of spans of the same length as this one. Must be token-based.
- Returns
Numpy array containing a boolean mask of all entries that are over the same tokenization of the same target text. Two spans with target text of None are considered to have the same target text.
- split_by_document() List[text_extensions_for_pandas.array.span.SpanArray] [source]¶
- Returns
A list of slices of this SpanArray that cover single documents.
- take(indices: Sequence[int], allow_fill: bool = False, fill_value: Optional[Any] = None) text_extensions_for_pandas.array.token_span.TokenSpanArray [source]¶
See docstring in ExtensionArray class in pandas/core/arrays/base.py for information about this method.
- property target_text: numpy.ndarray¶
“document” texts that the spans in this array reference, as opposed to the regions of these documents that the spans cover.
- Type
return
- property tokens: numpy.ndarray¶
The tokens over which each TokenSpan in this array are defined as an ndarray of object.
- Type
return
TokenSpan Class: Object to represent a single token-based span¶
- class text_extensions_for_pandas.TokenSpan(tokens: Any, begin_token: int, end_token: int)[source]¶
Python object representation of a single span with token offsets; that is, a single row of a TokenSpanArray.
This class is also a subclass of Span and can return character-level information.
An offset of TokenSpan.NULL_OFFSET_VALUE (currently -1) indicates “not a span” in the sense that NaN is “not a number”.
Public Data Attributes:
USE_TOKEN_OFFSETS_IN_REPR
tokens
begin_token
end_token
Inherited from
Span
NULL_OFFSET_VALUE
begin
end
target_text
covered_text
Returns the substring of self.target_text that this Span represents.
Public Methods:
__init__
(tokens, begin_token, end_token)- param tokens
Tokenization information about the document, including
make_null
(tokens)Convenience method for building null spans.
__repr__
()Return repr(self).
__eq__
(other)Return self==value.
__hash__
()Return hash(self).
__lt__
(other)span1 < span2 if span1.end <= span2.begin
Inherited from
Span
__init__
(tokens, begin_token, end_token)- param tokens
Tokenization information about the document, including
__repr__
()Return repr(self).
__eq__
(other)Return self==value.
__hash__
()Return hash(self).
__lt__
(other)span1 < span2 if span1.end <= span2.begin
__gt__
(other)Return self>value.
__le__
(other)Return self<=value.
__ge__
(other)Return self>=value.
overlaps
(other)- param other
Another Span or TokenSpan
contains
(other)- param other
Another Span or TokenSpan
context
([num_chars])Show the location of this span in the context of the target string.
Inherited from
TokenSpanOpMixin
__add__
(other)Add a pair of spans and/or span arrays.
Inherited from
SpanOpMixin
__add__
(other)Add a pair of spans and/or span arrays.
Tensor Extension Type¶
The TensorDtype
extension data type is efficiently stores
tensors in the rows of a Pandas Series.
For efficiency, we store all of the tensors in a Series in a single
NumPy array.
- class text_extensions_for_pandas.TensorDtype[source]¶
Pandas data type for a column of tensors with the same shape.
TensorArray Class: Store tensors in a Pandas Series¶
- class text_extensions_for_pandas.TensorArray(values: Union[numpy.ndarray, Sequence[Union[numpy.ndarray, text_extensions_for_pandas.array.tensor.TensorElement]], text_extensions_for_pandas.array.tensor.TensorElement, Any])[source]¶
A Pandas
ExtensionArray
that represents a column ofnumpy.ndarray
objects, or tensors, where the outer dimension is the count of tensors in the column. Each tensor must have the same shape.Public Data Attributes:
See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.Return string describing type of TensorArray.
See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.Get the dtype of the tensor.
Get the number of tensor dimensions.
Get the shape of the tensor.
Inherited from
ExtensionArray
See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.shape
Return a tuple of the array dimensions.
size
The number of elements in the array.
ndim
Extension Arrays are only allowed to be 1-dimensional.
See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.T
Public Methods:
__init__
(values)- param values
A
numpy.ndarray
or sequence of
isna
()See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.copy
()See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.take
(indices[, allow_fill, fill_value])See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.to_numpy
([dtype, copy, na_value])See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.astype
(dtype[, copy])See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.any
([axis, out, keepdims])Test whether any array element along a given axis evaluates to
True
.all
([axis, out, keepdims])Test whether all array elements along a given axis evaluate to
True
.__len__
()Length of this array
__getitem__
(item)See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.__setitem__
(key, value)See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.__contains__
(item)Return for item in self.
__repr__
()See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.__str__
()Return str(self).
__array__
([dtype])Interface to return the backing tensor as a numpy array with optional dtype.
__array_ufunc__
(ufunc, method, *inputs, **kwargs)Interface to handle numpy ufuncs that will accept TensorArray as input, and wrap the output back as another TensorArray.
__arrow_array__
([type])__add__
(other)__radd__
(other)__sub__
(other)__rsub__
(other)__mul__
(other)__rmul__
(other)__pow__
(other)__rpow__
(other)__mod__
(other)__rmod__
(other)__floordiv__
(other)__rfloordiv__
(other)__truediv__
(other)__rtruediv__
(other)__divmod__
(other)__rdivmod__
(other)__eq__
(other)Return for self == other (element-wise equality).
__ne__
(other)Return for self != other (element-wise in-equality).
__lt__
(other)Return self<value.
__gt__
(other)Return self>value.
__le__
(other)Return self<=value.
__ge__
(other)Return self>=value.
Inherited from
ExtensionArray
__getitem__
(item)See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.__setitem__
(key, value)See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.__len__
()Length of this array
__iter__
()Iterate over elements of the array.
__contains__
(item)Return for item in self.
__eq__
(other)Return for self == other (element-wise equality).
__ne__
(other)Return for self != other (element-wise in-equality).
to_numpy
([dtype, copy, na_value])See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.astype
(dtype[, copy])See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.isna
()See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.argsort
([ascending, kind, na_position])Return the indices that would sort this array.
argmin
([skipna])Return the index of minimum value.
argmax
([skipna])Return the index of maximum value.
fillna
([value, method, limit])Fill NA/NaN values using the specified method.
dropna
()Return ExtensionArray without NA values.
shift
([periods, fill_value])Shift values by desired number.
unique
()Compute the ExtensionArray of unique values.
searchsorted
(value[, side, sorter])Find indices where elements should be inserted to maintain order.
equals
(other)Return if another array is equivalent to this array.
isin
(values)Pointwise comparison for set containment in the given values.
factorize
([na_sentinel])Encode the extension array as an enumerated type.
repeat
(repeats[, axis])Repeat elements of a ExtensionArray.
take
(indices[, allow_fill, fill_value])See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.copy
()See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.view
([dtype])Return a view on the array.
__repr__
()See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.transpose
(*axes)Return a transposed view on this array.
ravel
([order])Return a flattened view on this array.
delete
(loc)
- all(axis=None, out=None, keepdims=False)[source]¶
Test whether all array elements along a given axis evaluate to
True
.- Parameters
axis – Axis or axes along which a logical AND reduction is performed.
out – Alternate output array in which to place the result.
keepdims – If this is set to True, the axes which are reduced are left in the result as dimensions with size one.
- Returns
single boolean unless
axis
is notNone
; elseTensorArray
- any(axis=None, out=None, keepdims=False)[source]¶
Test whether any array element along a given axis evaluates to
True
.See numpy.any() documentation for more information https://numpy.org/doc/stable/reference/generated/numpy.any.html#numpy.any
- Parameters
axis – Axis or axes along which a logical OR reduction is performed.
out – Alternate output array in which to place the result.
keepdims – If this is set to True, the axes which are reduced are left in the result as dimensions with size one.
- Returns
single boolean unless
axis``is not ``None
; elseTensorArray
- astype(dtype, copy=True)[source]¶
See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.
- copy() text_extensions_for_pandas.array.tensor.TensorArray [source]¶
See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.
- property dtype: pandas.core.dtypes.base.ExtensionDtype¶
See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.
- property inferred_type: str¶
Return string describing type of TensorArray. Delegates to
pandas.api.types.infer_dtype()
. See docstring for more information.- Returns
string describing numpy type of this TensorArray
- isna() numpy.array [source]¶
See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.
- property nbytes: int¶
See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.
- property numpy_dtype¶
Get the dtype of the tensor.
- Returns
The numpy dtype of the backing ndarray
- property numpy_ndim¶
Get the number of tensor dimensions.
- Returns
integer for the number of dimensions
- property numpy_shape¶
Get the shape of the tensor.
- Returns
A tuple of integers for the numpy shape of the backing ndarray
- take(indices: Sequence[int], allow_fill: bool = False, fill_value: Optional[Any] = None) text_extensions_for_pandas.array.tensor.TensorArray [source]¶
See docstring in
ExtensionArray
class inpandas/core/arrays/base.py
for information about this method.
TensorElement Class: Object to represent a single tensor¶
- class text_extensions_for_pandas.TensorElement(values: numpy.ndarray)[source]¶
Class representing a single element in a TensorArray, or row in a Pandas column of dtype TensorDtype. This is a light wrapper over a numpy.ndarray
Public Methods:
__init__
(values)Construct a TensorElement from an numpy.ndarray.
__repr__
()Return repr(self).
__str__
()Return str(self).
to_numpy
()Return the values of this element as a numpy.ndarray
__array__
()__add__
(other)__radd__
(other)__sub__
(other)__rsub__
(other)__mul__
(other)__rmul__
(other)__pow__
(other)__rpow__
(other)__mod__
(other)__rmod__
(other)__floordiv__
(other)__rfloordiv__
(other)__truediv__
(other)__rtruediv__
(other)__divmod__
(other)__rdivmod__
(other)__eq__
(other)Return self==value.
__ne__
(other)Return self!=value.
__lt__
(other)Return self<value.
__gt__
(other)Return self>value.
__le__
(other)Return self<=value.
__ge__
(other)Return self>=value.