Spanner Algebra¶
The spanner
module of Text Extensions for Pandas provides span-specific operations
for Pandas DataFrames, based on the Document Spanners formalism, also known as
spanner algebra.
Spanner algebra is an extension of relational algebra with additional operations to cover NLP applications. See the paper [“Document Spanners: A Formal Approach to Information Extraction”]( https://researcher.watson.ibm.com/researcher/files/us-fagin/jacm15.pdf) by Fagin et al. for more information.
- text_extensions_for_pandas.spanner.adjacent_join(first_series: pandas.core.series.Series, second_series: pandas.core.series.Series, first_name: str = 'first', second_name: str = 'second', min_gap: int = 0, max_gap: int = 0)[source]¶
Compute the join of two series of spans, where a pair of spans is considered to match if they are adjacent to each other in the text.
- Parameters
first_series – Spans that appear earlier. dtype must be TokenSpanDtype.
second_series – Spans that come after. dtype must be TokenSpanDtype.
first_name – Name to give the column in the returned dataframe that is derived from first_series.
second_name – Column name for spans from second_series in the returned DataFrame.
min_gap – Minimum number of spans allowed between matching pairs of spans, inclusive.
max_gap – Maximum number of spans allowed between matching pairs of spans, inclusive.
- Returns
a new pd.DataFrame containing all pairs of spans that match the join predicate. Columns of the DataFrame will be named according to the first_name and second_name arguments.
- text_extensions_for_pandas.spanner.consolidate(df: pandas.core.frame.DataFrame, on: str, how: str = 'left_to_right') pandas.core.frame.DataFrame [source]¶
Eliminate overlap among the spans in one column of a
pd.DataFrame
.- Parameters
df – DataFrame containing spans and other attributes
on – Name of column in df on which to perform consolidation
how – What policy to use to decide what spans are considered to overlap and which of an overlapping pair will remain after consolidation. Available policies: *
left_to_right
: Walk through the spans from left to right, keeping the longest non-overlapping match at each position encountered
- Returns
the rows of df that remain after applying the specified policy to the spans in the column specified by on.
- text_extensions_for_pandas.spanner.contain_join(first_series: pandas.core.series.Series, second_series: pandas.core.series.Series, first_name: str = 'first', second_name: str = 'second')[source]¶
Compute the join of two series of spans, where a pair of spans is considered to match if the second span is contained within the first.
- Parameters
first_series – First set of spans to join, wrapped in a pd.Series
second_series – Second set of spans to join. These are the ones that are contained within the first set where the join predicate is satisfied.
first_name – Name to give the column in the returned dataframe that is derived from first_series.
second_name – Column name for spans from second_series in the returned DataFrame.
- Returns
a new pd.DataFrame containing all pairs of spans that match the join predicate. Columns of the DataFrame will be named according to the first_name and second_name arguments.
- text_extensions_for_pandas.spanner.create_dict(entries: Iterable[str], tokenizer: spacy.tokenizer.Tokenizer = None) pandas.core.frame.DataFrame [source]¶
Create a dictionary from a list of entries, where each entry is expressed as a single string.
Tokenizes and normalizes the dictionary entries.
- Parameters
entries – Iterable of strings, one string per dictionary entry.
tokenizer – Preconfigured tokenizer object for tokenizing
dictionary entries. Must always tokenize the same way as the tokenizer used on the target text! If None, this method will use tokenizer returned by
text_extensions_for_pandas.io.spacy.simple_tokenizer()
.- Returns
pd.DataFrame
with the normalized, tokenized dictionary entries.
- text_extensions_for_pandas.spanner.extract_dict(tokens: Union[text_extensions_for_pandas.array.span.SpanArray, pandas.core.series.Series], dictionary: pandas.core.frame.DataFrame, output_col_name: str = 'match')[source]¶
Identify all matches of a dictionary on a sequence of tokens.
- Parameters
tokens –
SpanArray
of token information, optionally wrapped in apd.Series
. These tokens must come from the same tokenizer that tokenized the entries of ``dictionary``. To tokenize with SpaCy, usetext_extensions_for_pandas.io.spacy.make_tokens()
.dictionary – The dictionary to match, encoded as a
pd.DataFrame
in the format returned byload_dict()
output_col_name – (optional) name of column of matching spans in the returned DataFrame
- Returns
a single-column DataFrame of token ID spans of dictionary matches
- text_extensions_for_pandas.spanner.extract_regex(doc_text: str, compiled_regex: re.Pattern)[source]¶
Identify all non-overlapping matches of a regular expression, as returned by
re.Pattern.finditer()
, and return those locations as an array of spans.- Parameters
doc_text – Text of the document; will be the target text of the returned spans.
compiled_regex – Regular expression to evaluate, compiled with either the
re
or theregex
package.
- Returns
SpanArray
containing a span for each match of the regex.
- text_extensions_for_pandas.spanner.extract_regex_tok(tokens: Union[text_extensions_for_pandas.array.span.SpanArray, pandas.core.series.Series], compiled_regex: regex.regex.compile, min_len=1, max_len=1, output_col_name: str = 'match')[source]¶
Identify all (possibly overlapping) matches of a regular expression that start and end on token boundaries.
- Parameters
tokens –
SpanArray
of token information, optionally wrapped in a pd.Series.compiled_regex – Regular expression to evaluate.
min_len – Minimum match length in tokens
max_len – Maximum match length (inclusive) in tokens
output_col_name – (optional) name of column of matching spans in the returned DataFrame
- Returns
A single-column DataFrame containing a span for each match of the regex.
- text_extensions_for_pandas.spanner.extract_split(doc_text: str, split_points: Union[Sequence[int], numpy.ndarray, text_extensions_for_pandas.array.span.SpanArray]) text_extensions_for_pandas.array.span.SpanArray [source]¶
Split a document into spans along a specified set of split points.
- Parameters
doc_text – Text of the document; will be the target text of the returned spans.
split_points –
A series of offsets into
doc_text
, expressed as either: * A sequence of integers (split at certain locations and return a set of splits thatcovers every character in the document) as a list or 1-d Numpy array
A sequence of spans (split around the indicated locations, but discard the parts of the document that are within a split point)
- Returns
SpanArray
that splits the document in the specified way.
- text_extensions_for_pandas.spanner.lemmatize(spans: Union[pandas.core.series.Series, text_extensions_for_pandas.array.span.SpanArray, Iterable[text_extensions_for_pandas.array.span.Span]], token_features: pandas.core.frame.DataFrame, lemma_col_name: str = 'lemma', token_span_col_name: str = 'span') List[str] [source]¶
Convert spans to their normal form using lemma information in a token features table.
- Parameters
spans – Spans to be normalized. Each may represent zero or more tokens.
token_features – DataFrame of token metadata. Index must be aligned with the token indices in spans.
lemma_col_name – Optional custom name for the DataFrame column containing the lemmatized form of each token.
token_span_col_name – Optional custom name for the DataFrame column containing the span of each token.
- Returns
A list containing normalized versions of the tokens in spans, with each token separated by single space character.
- text_extensions_for_pandas.spanner.load_dict(file_name: str, tokenizer: spacy.tokenizer.Tokenizer = None)[source]¶
Load a SystemT-format dictionary file. File format is one entry per line.
Tokenizes and normalizes the dictionary entries.
- Parameters
file_name – Path to dictionary file
tokenizer – Preconfigured tokenizer object for tokenizing
dictionary entries. Must be the same configuration as the tokenizer used on the target text! If None, this method will use SpaCy’s default English tokenizer.
- Returns
pd.DataFrame
with the normalized, tokenized dictionary entries.
- text_extensions_for_pandas.spanner.overlap_join(first_series: pandas.core.series.Series, second_series: pandas.core.series.Series, first_name: str = 'first', second_name: str = 'second')[source]¶
Compute the join of two series of spans, where a pair of spans is considered to match if they overlap.
- Parameters
first_series – First set of spans to join, wrapped in a pd.Series
second_series – Second set of spans to join.
first_name – Name to give the column in the returned dataframe that is derived from first_series.
second_name – Column name for spans from second_series in the returned DataFrame.
- Returns
a new pd.DataFrame containing all pairs of spans that match the join predicate. Columns of the DataFrame will be named according to the first_name and second_name arguments.
- text_extensions_for_pandas.spanner.unpack_semijoin(target_region: text_extensions_for_pandas.array.span.Span, model_results: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]¶
Unpack the results of evaluating an extraction model, such as dependency parsing or named entity recognition, using a semijoin strategy to reduce the amount of text over which the model is applied.
To use
unpack_semijoin()
, first identify regions of the text that you wish to run the model. Then run the model over the text of those regions to produce spans whose begin and end offsets are relative to the text of each distinct target region. Then you can pass the spans and the model results to this function to produce result spans whose begin and end offsets are relative to the original document text.- Parameters
target_region – Span indicating a section of the original document text over which the model was applied.
model_results – Results from running your model over
target_region
, as apd.DataFrame
.
- Returns
A
pd.DataFrame
with the same schema asmodel_results
, but with all spans converted from spans over the target text oftarget_region
to spans over the original document text.