Spanner Algebra¶

The spanner module of Text Extensions for Pandas provides span-specific operations for Pandas DataFrames, based on the Document Spanners formalism, also known as spanner algebra.

Spanner algebra is an extension of relational algebra with additional operations to cover NLP applications. See the paper [“Document Spanners: A Formal Approach to Information Extraction”]( https://researcher.watson.ibm.com/researcher/files/us-fagin/jacm15.pdf) by Fagin et al. for more information.

text_extensions_for_pandas.spanner.adjacent_join(first_series: pandas.core.series.Series, second_series: pandas.core.series.Series, first_name: str = 'first', second_name: str = 'second', min_gap: int = 0, max_gap: int = 0)[source]¶

Compute the join of two series of spans, where a pair of spans is considered to match if they are adjacent to each other in the text.

Parameters

first_series – Spans that appear earlier. dtype must be TokenSpanDtype.
second_series – Spans that come after. dtype must be TokenSpanDtype.
first_name – Name to give the column in the returned dataframe that is derived from first_series.
second_name – Column name for spans from second_series in the returned DataFrame.
min_gap – Minimum number of spans allowed between matching pairs of spans, inclusive.
max_gap – Maximum number of spans allowed between matching pairs of spans, inclusive.

Returns

a new pd.DataFrame containing all pairs of spans that match the join predicate. Columns of the DataFrame will be named according to the first_name and second_name arguments.

text_extensions_for_pandas.spanner.consolidate(df: pandas.core.frame.DataFrame, on: str, how: str = 'left_to_right') → pandas.core.frame.DataFrame[source]¶

Eliminate overlap among the spans in one column of a pd.DataFrame.

Parameters

df – DataFrame containing spans and other attributes
on – Name of column in df on which to perform consolidation
how – What policy to use to decide what spans are considered to overlap and which of an overlapping pair will remain after consolidation. Available policies: * left_to_right: Walk through the spans from left to right, keeping the longest non-overlapping match at each position encountered

Returns

the rows of df that remain after applying the specified policy to the spans in the column specified by on.

text_extensions_for_pandas.spanner.contain_join(first_series: pandas.core.series.Series, second_series: pandas.core.series.Series, first_name: str = 'first', second_name: str = 'second')[source]¶

Compute the join of two series of spans, where a pair of spans is considered to match if the second span is contained within the first.

Parameters

first_series – First set of spans to join, wrapped in a pd.Series
second_series – Second set of spans to join. These are the ones that are contained within the first set where the join predicate is satisfied.
first_name – Name to give the column in the returned dataframe that is derived from first_series.
second_name – Column name for spans from second_series in the returned DataFrame.

Returns

a new pd.DataFrame containing all pairs of spans that match the join predicate. Columns of the DataFrame will be named according to the first_name and second_name arguments.

text_extensions_for_pandas.spanner.create_dict(entries: Iterable[str], tokenizer: spacy.tokenizer.Tokenizer = None) → pandas.core.frame.DataFrame[source]¶

Create a dictionary from a list of entries, where each entry is expressed as a single string.

Tokenizes and normalizes the dictionary entries.

Parameters

entries – Iterable of strings, one string per dictionary entry.
tokenizer – Preconfigured tokenizer object for tokenizing

dictionary entries. Must always tokenize the same way as the tokenizer used on the target text! If None, this method will use tokenizer returned by text_extensions_for_pandas.io.spacy.simple_tokenizer().

Returns: pd.DataFrame with the normalized, tokenized dictionary entries.

text_extensions_for_pandas.spanner.extract_dict(tokens: Union[text_extensions_for_pandas.array.span.SpanArray, pandas.core.series.Series], dictionary: pandas.core.frame.DataFrame, output_col_name: str = 'match')[source]¶

Identify all matches of a dictionary on a sequence of tokens.

Parameters

tokens – SpanArray of token information, optionally wrapped in a pd.Series. These tokens must come from the same tokenizer that tokenized the entries of ``dictionary``. To tokenize with SpaCy, use text_extensions_for_pandas.io.spacy.make_tokens().
dictionary – The dictionary to match, encoded as a pd.DataFrame in the format returned by load_dict()
output_col_name – (optional) name of column of matching spans in the returned DataFrame

Returns

a single-column DataFrame of token ID spans of dictionary matches

text_extensions_for_pandas.spanner.extract_regex(doc_text: str, compiled_regex: re.Pattern)[source]¶

Identify all non-overlapping matches of a regular expression, as returned by re.Pattern.finditer(), and return those locations as an array of spans.

Parameters

doc_text – Text of the document; will be the target text of the returned spans.
compiled_regex – Regular expression to evaluate, compiled with either the re or the regex package.

Returns

SpanArray containing a span for each match of the regex.

text_extensions_for_pandas.spanner.extract_regex_tok(tokens: Union[text_extensions_for_pandas.array.span.SpanArray, pandas.core.series.Series], compiled_regex: regex.regex.compile, min_len=1, max_len=1, output_col_name: str = 'match')[source]¶

Identify all (possibly overlapping) matches of a regular expression that start and end on token boundaries.

Parameters

tokens – SpanArray of token information, optionally wrapped in a pd.Series.
compiled_regex – Regular expression to evaluate.
min_len – Minimum match length in tokens
max_len – Maximum match length (inclusive) in tokens
output_col_name – (optional) name of column of matching spans in the returned DataFrame

Returns

A single-column DataFrame containing a span for each match of the regex.

text_extensions_for_pandas.spanner.extract_split(doc_text: str, split_points: Union[Sequence[int], numpy.ndarray, text_extensions_for_pandas.array.span.SpanArray]) → text_extensions_for_pandas.array.span.SpanArray[source]¶

Split a document into spans along a specified set of split points.

Parameters

doc_text – Text of the document; will be the target text of the returned spans.
split_points –
A series of offsets into doc_text, expressed as either: * A sequence of integers (split at certain locations and return a set of splits that

covers every character in the document) as a list or 1-d Numpy array
- A sequence of spans (split around the indicated locations, but discard the parts of the document that are within a split point)

Returns

SpanArray that splits the document in the specified way.

text_extensions_for_pandas.spanner.lemmatize(spans: Union[pandas.core.series.Series, text_extensions_for_pandas.array.span.SpanArray, Iterable[text_extensions_for_pandas.array.span.Span]], token_features: pandas.core.frame.DataFrame, lemma_col_name: str = 'lemma', token_span_col_name: str = 'span') → List[str][source]¶

Convert spans to their normal form using lemma information in a token features table.

Parameters

spans – Spans to be normalized. Each may represent zero or more tokens.
token_features – DataFrame of token metadata. Index must be aligned with the token indices in spans.
lemma_col_name – Optional custom name for the DataFrame column containing the lemmatized form of each token.
token_span_col_name – Optional custom name for the DataFrame column containing the span of each token.

Returns

A list containing normalized versions of the tokens in spans, with each token separated by single space character.

text_extensions_for_pandas.spanner.load_dict(file_name: str, tokenizer: spacy.tokenizer.Tokenizer = None)[source]¶

Load a SystemT-format dictionary file. File format is one entry per line.

Tokenizes and normalizes the dictionary entries.

Parameters

file_name – Path to dictionary file
tokenizer – Preconfigured tokenizer object for tokenizing

dictionary entries. Must be the same configuration as the tokenizer used on the target text! If None, this method will use SpaCy’s default English tokenizer.

Returns: pd.DataFrame with the normalized, tokenized dictionary entries.

text_extensions_for_pandas.spanner.overlap_join(first_series: pandas.core.series.Series, second_series: pandas.core.series.Series, first_name: str = 'first', second_name: str = 'second')[source]¶

Compute the join of two series of spans, where a pair of spans is considered to match if they overlap.

Parameters

first_series – First set of spans to join, wrapped in a pd.Series
second_series – Second set of spans to join.
first_name – Name to give the column in the returned dataframe that is derived from first_series.
second_name – Column name for spans from second_series in the returned DataFrame.

Returns

a new pd.DataFrame containing all pairs of spans that match the join predicate. Columns of the DataFrame will be named according to the first_name and second_name arguments.

text_extensions_for_pandas.spanner.unpack_semijoin(target_region: text_extensions_for_pandas.array.span.Span, model_results: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶

Unpack the results of evaluating an extraction model, such as dependency parsing or named entity recognition, using a semijoin strategy to reduce the amount of text over which the model is applied.

To use unpack_semijoin(), first identify regions of the text that you wish to run the model. Then run the model over the text of those regions to produce spans whose begin and end offsets are relative to the text of each distinct target region. Then you can pass the spans and the model results to this function to produce result spans whose begin and end offsets are relative to the original document text.

Parameters

target_region – Span indicating a section of the original document text over which the model was applied.
model_results – Results from running your model over target_region, as a pd.DataFrame.

Returns

A pd.DataFrame with the same schema as model_results, but with all spans converted from spans over the target text of target_region to spans over the original document text.