Input and Output¶

Text Extensions for Pandas includes functionality for converting the outputs of common NLP libraries into Pandas DataFrames. This section describes these I/O-related integrations.

In addition to the functionality described in this section, our extension types also support Pandas’ native serialization via Apache Arrow, including the to_feather and read_feather methods for binary file I/O.

IBM Watson Natural Language Understanding¶

This module of Text Extensions for Pandas includes I/O functions related to the Watson Natural Language Understanding service on the IBM Cloud. This service provides analysis of text feature through a request/response API. See https://cloud.ibm.com/docs/natural-language-understanding?topic=natural-language-understanding-getting-started for information on getting started with the service. Details of the API and available features can be found at https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#introduction. For convenience, a Python SDK is available at https://github.com/watson-developer-cloud/python-sdk that can be used to authenticate and make requests to the service.

text_extensions_for_pandas.io.watson.nlu.make_span_from_entities(char_span: text_extensions_for_pandas.array.span.SpanArray, entities_frame: pandas.core.frame.DataFrame, entity_col: str = 'text') → text_extensions_for_pandas.array.token_span.TokenSpanArray[source]¶

Create a token span array for entity text from the entities DataFrame, and an existing char span array with tokens from the entire analyzed text.

Parameters

char_span – Parsed tokens
entities_frame – Entities DataFrame from parse_response
entity_col – Column name for the entity text

Returns

TokenSpanArray for matching entities

text_extensions_for_pandas.io.watson.nlu.parse_response(response: Dict[str, Any], original_text: Optional[str] = None, apply_standard_schema: bool = False) → Dict[str, pandas.core.frame.DataFrame][source]¶

Parse a Watson NLU response as a decoded JSON string, e.g. dictionary containing requested features and convert into a dict of Pandas DataFrames. The following features in the response will be converted:

entities

entity_mentions (elements of the “mentions” field of response[“entities”])

keywords

relations

semantic_roles

syntax

For information on getting started with Watson Natural Language Understanding on IBM Cloud, see https://cloud.ibm.com/docs/natural-language-understanding?topic=natural-language-understanding-getting-started. A Python SDK for authentication and making requests to the service is provided at https://github.com/watson-developer-cloud/python-sdk. Details on the supported features and available options when making the request can be found at https://cloud.ibm.com/apidocs/natural-language-understanding?code=python#analyze-text.

Note

Additional feature data in response will not be processed

>>> response = natural_language_understanding.analyze(
...     url="https://raw.githubusercontent.com/CODAIT/text-extensions-for-pandas/master/resources/holy_grail.txt",
...         return_analyzed_text=True,
...         features=Features(
...         entities=EntitiesOptions(sentiment=True),
...         keywords=KeywordsOptions(sentiment=True, emotion=True),
...         relations=RelationsOptions(),
...         semantic_roles=SemanticRolesOptions(),
...         syntax=SyntaxOptions(sentences=True, tokens=SyntaxOptionsTokens(lemma=True, part_of_speech=True))
...     )).get_result()
>>> dfs = parse_response(response)
>>> dfs.keys()
dict_keys(['syntax', 'entities', 'keywords', 'relations', 'semantic_roles'])
>>> dfs["syntax"].head()
   span part_of_speech      lemma      0  [0, 5): 'Monty'          PROPN    None
1  [6, 12): 'Python'        PROPN  python

                                            sentence
0  [0, 273): 'Monty Python and the Holy Grail is ...
1  [0, 273): 'Monty Python and the Holy Grail is ...

Parameters

response – A dictionary of features from the IBM Watson NLU response
original_text – Optional original text sent in request, if None will look for “analyzed_text” keyword in response
apply_standard_schema – Return DataFrames with a set schema, whether data was present in the response or not

Returns

A dictionary mapping feature name to a Pandas DataFrame

IBM Watson Discovery Table Understanding¶

This module of Text Extensions for Pandas includes I/O functions related to the Table Understanding capabilities of Watson Discovery.

Table Understanding is available as part of the Watson Discovery component for IBM Cloud Pak for Data,

Table Understanding is also available in Watson Compare and Comply table extraction on the IBM Cloud. Details of the Compare and Comply API and available features can be found at https://cloud.ibm.com/apidocs/compare-comply?code=python#extract-a-document-s-tables For convenience, a Python SDK is available at https://github.com/watson-developer-cloud/python-sdk that can be used to authenticate and make requests to the service.

text_extensions_for_pandas.io.watson.tables.convert_cols_to_numeric(df_in: pandas.core.frame.DataFrame, columns=None, rows=None, decimal_pt='.', cast_type=<class 'float'>) → pandas.core.frame.DataFrame[source]¶

Converts inputted columns or rows to numeric format.

If no columns are given, it converts all elements to numeric types

Converts to type specified, if not, defaults to float type

Parameters

df_in – dataframe, (table type) to convert
columns – columns to convert to numeric type, as a list of strings
rows – rows to convert to numeric type, as a list of strings
decimal_pt – what symbol is being used as the decimal point (typically "." or ","
cast_type – type to cast the object to, as a class. Defaults to float

Returns

the converted table.

text_extensions_for_pandas.io.watson.tables.get_raw_html(doc_response, parsed_table)[source]¶

Parameters

doc_response – JSON response from the Watson Table Understanding enrichment
parsed_table – Table pulled out of the response by functions in this model

Returns

Raw HTML for the table’s original document markup

text_extensions_for_pandas.io.watson.tables.make_exploded_df(dfs_dict: Dict[str, pandas.core.frame.DataFrame], drop_original: bool = True, row_explode_by: Optional[str] = None, col_explode_by: Optional[str] = None, keep_all_cols: bool = False) → Tuple[pandas.core.frame.DataFrame, list, list][source]¶

Creates a value-attribute mapping, mapping the column values to header or row number values this is a preliminary stage to creating the final table, but may be a useful intermediary

Parameters

dfs_dict – The dictionary of {features : DataFrames} returned by watson_tables_parse_response
drop_original – drop the original column location information. defaults to True
row_explode_by – If specified, set the method used to explode rows, instead of the default logic being applied: if “title”, the title field will be used to arrange rows if “title_id”, the title_id feild will be used to arrange rows if “index”, the row / column locations given will be used to arrange rows
col_explode_by – if specified, set the method used to explode columns, instead of the default logic bing applied if “title”, the title field will be used to arrange rows if “title_id”, the title_id feild will be used to arrange rows if “index”, the row / column locations given will be used to arrange rows
keep_all_cols – if false, keep only attributes necessary for constructing final table. gets overridden if drop_original is False.

Returns

a table mapping values to attributes (either headings or row numbers if no headings exist)

text_extensions_for_pandas.io.watson.tables.make_table(dfs_dict: Dict[str, pandas.core.frame.DataFrame], value_col='text', row_explode_by: Optional[str] = None, col_explode_by: Optional[str] = None, concat_with: str = ' | ', convert_numeric_items: bool = True, sort_headers: bool = True, prevent_id_explode: bool = False)[source]¶

Runs the end-to-end process of creating the table, starting with the parsed response from the Compare & Comply or Watson Discovery engine, and returns the completed table.

Parameters

dfs_dict – The dictionary of {features : DataFrames} returned by watson_tables_parse_response
value_col – Which column to use as values. by default “text”
row_explode_by – If specified, set the method used to explode rows, instead of the default logic being applied if “title”, the title field will be used to arrange rows if “title_id”, the title_id feild will be used to arrange rows if “index”, the row / column locations given will be used to arrange rows
col_explode_by – if specified, set the method used to explode columns, instead of the default logic bing applied if “title”, the title field will be used to arrange rows if “title_id”, the title_id feild will be used to arrange rows if “index”, the row / column locations given will be used to arrange rows
concat_with – the delimiter to use when concatinating duplicate entries. Using an empty string, “” will fuse entries
convert_numeric_items – if True, auto-detect and convert numeric rows and columns to numeric datatypes
sort_headers – If True the headers will be sorted into their original ordering from the table. Will be a little slower. Note: sorting headers is still experimental on multindex tables where not all headers have the same number of elements
prevent_id_explode – If True, prevents default behaviour of exploding by index, which creates higher-fidelity versions of the parsed output, but may make more complex and less idiomatic tables. This does not affect behaviour when either row_explode_by or column_explode_by are set to “title_id”

Returns

the reconstructed table. should be a 1:1 translation of original table

text_extensions_for_pandas.io.watson.tables.make_table_from_exploded_df(exploded_df: pandas.core.frame.DataFrame, row_heading_cols, column_heading_cols, dfs_dict=None, value_col: str = 'text', concat_with: str = ' | ', convert_numeric_items=False, sort_headers=True) → pandas.core.frame.DataFrame[source]¶

takes in the exploded dataframe, and converts it into the reconstructed table

Parameters

exploded_df – The exploded dataframe, as returned by make_exploded_df
row_heading_cols – the names of the columns referring to row headings (as outputted from make_exploded_df())
column_heading_cols – the names of the columns referring to column headings (as outputted from make_exploded_df())
value_col – the name of the column to use for the value of each cell. Defaults to ‘text’
concat_with – the delimiter to use when concatinating duplicate entries. Using an empty string, “” will fuse entries
dfs_dict – Dictionary parsed from initial step of table reconstruction. Is required to re-order columns into their original format. If not, the reordering will not take place
convert_numeric_items – if True, rows or columns with numeric items will be detected and converted to floats or ints
sort_headers – If True the headers will be sorted into their original ordering from the table. Will be a little slower. Note: sorting headers is still experimental on multindex tables where not all headers have the same number of elements

Returns

the reconstructed table. should be a 1:1 translation of original table, but both machine and human readable

text_extensions_for_pandas.io.watson.tables.parse_response(response: Dict[str, Any], select_table=None) → Dict[str, pandas.core.frame.DataFrame][source]¶

Parse a response from Watson Table Understanding as a decoded JSON string. e.g. dictionary containing requested features and convert into a dict of Pandas DataFrames.

The following features will be converted from the response:

Row headers

Column headers

Body cells

More information on using Watson Table Extraction or the Compare and Comply API, see https://cloud.ibm.com/docs/compare-comply?topic=compare-comply-understanding_tables More information about available features can be found at https://cloud.ibm.com/apidocs/compare-comply?code=python#extract-a-document-s-tables

Parameters

response – A dictionary of features returned by the IBM Watson Compare and Comply web service API or a comparable Watson Discovery API
table_number – Defaults to analyzing the first table, input a number here to analyze the nth table

Returns

A dictionary mapping feature names (“row_headers”, “col_headers”, “body_cells”) to Pandas DataFrames

text_extensions_for_pandas.io.watson.tables.substitute_text_names(table_in, dfs_dict, sub_rows: bool = True, sub_cols: bool = True)[source]¶

Parameters

table_in – Table to operate on
dfs_dict – Parsed representation from Watson response
sub_rows – Whether or not to attempt to substitute row headers
sub_cols – Whether or not to attempt to substitute column headers

Returns

The original table, but with row and column headers that were title ID’s replaced by the plaintext header they actually correspond to

CoNLL-2003 and CoNLL-U File Formats¶

The io.conll module contains I/O functions related to CoNLL-2003 file format and its derivatives, including CoNLL-U.

text_extensions_for_pandas.io.conll.add_token_classes(token_features: pandas.core.frame.DataFrame, token_class_dtype: Optional[pandas.core.dtypes.dtypes.CategoricalDtype] = None, iob_col_name: str = 'ent_iob', entity_type_col_name: str = 'ent_type') → pandas.core.frame.DataFrame[source]¶

Add additional columns to a dataframe of IOB-tagged tokens containing composite string and integer category labels for the tokens.

Parameters

token_features – Dataframe of tokens with IOB tags and entity type strings
token_class_dtype – Optional Pandas categorical dtype indicating how to map composite tags like I-PER to integer values. You can use make_iob_tag_categories() to generate this dtype. If this parameter is not provided, this function will use an arbitrary mapping using the values that appear in this dataframe.
iob_col_name – Optional name of a column in token_features that contains the IOB2 tags as strings, “I”, “O”, or “B”.
entity_type_col_name – Optional name of a column in token_features that contains entity type information; or None if no such column exists.

Returns

A copy of token_features with two additional columns, token_class (string class label) and token_class_id (integer label). If token_features contains columns with either of these names, those columns will be overwritten in the returned copy of token_features.

text_extensions_for_pandas.io.conll.combine_folds(fold_to_docs: Dict[str, List[pandas.core.frame.DataFrame]])[source]¶

Merge together multiple parts of a corpus (i.e. train, test, validation) into a single DataFrame of all tokens in the corpus.

Parameters: fold_to_docs – Mapping from fold name (“train”, “test”, etc.) to list of per-document DataFrames as produced by util.conll_to_bert(). All DataFrames must have the same schema, but any schema is ok.
Returns: corpus wide DataFrame with some additional leading columns fold and doc_num to tell what fold and document number within the fold each row of the dataframe comes from.

text_extensions_for_pandas.io.conll.compute_accuracy_by_document(corpus_dfs: Dict[Tuple[str, int], pandas.core.frame.DataFrame], output_dfs: Dict[Tuple[str, int], pandas.core.frame.DataFrame]) → pandas.core.frame.DataFrame[source]¶

Compute precision, recall, and F1 scores by document.

Parameters

corpus_dfs –
Gold-standard span/entity type pairs, as either:
- a dictionary of DataFrames, one DataFrames per document, indexed by tuples of (collection name, offset into collection)
- a list of DataFrames, one per document as returned by conll_2003_output_to_dataframes()
output_dfs – Model outputs, in the same format as gold_dfs (i.e. exactly the same column names). This is the format that produces.

text_extensions_for_pandas.io.conll.compute_global_accuracy(stats_by_doc: pandas.core.frame.DataFrame)[source]¶

Compute collection-wide precision, recall, and F1 score from the output of compute_f1_by_document().

Parameters: stats_by_doc – Output of make_stats_df()
Returns: A Python dictionary of collection-level statistics about result quality.

text_extensions_for_pandas.io.conll.conll_2003_output_to_dataframes(doc_dfs: List[pandas.core.frame.DataFrame], input_file: str, column_name: str = 'ent', copy_tokens: bool = False) → List[pandas.core.frame.DataFrame][source]¶

Parse a file in CoNLL-2003 output format into a DataFrame.

CoNLL-2003 output format looks like this::

O O I-LOC O O

I-PER I-PER

Note the lack of any information about the tokens themselves. Note also the lack of any information about document boundaries.

Parameters

doc_dfs – List of pd.DataFrame`s of token information, as returned by :func:`conll_2003_to_dataframes. This is needed because CoNLL-2003 output format does not include any information about document boundaries.
input_file – Location of input file to read.
column_name – Name for the metadata value that the IOB-tagged data in input_file encodes. If this name is present in doc_dfs, its value will be replaced with the data from input_file; otherwise a new column will be added to each dataframe.
copy_tokens – If True, deep-copy token series from the elements of doc_dfs instead of using pointers.

Returns

A list containing, for each document in the input file, a separate pd.DataFrame of four columns:

span: Span of each token, with character offsets. Backed by the concatenation of the tokens in the document into a single string with one sentence per line.
token_span: Span of each token, with token offsets. Backed by the contents of the span column.
<column_name>_iob: IOB2-format tags of tokens, corrected so that every entity begins with a “B” tag.
<column_name>_type: Entity type names for tokens tagged “I” or “B” in the <column_name>_iob column; None everywhere else.

text_extensions_for_pandas.io.conll.conll_2003_to_dataframes(input_file: str, column_names: List[str], iob_columns: List[bool], space_before_punct: bool = False) → List[pandas.core.frame.DataFrame][source]¶

Parse a file in CoNLL-2003 training/test format into a DataFrame.

CoNLL-2003 training/test format looks like this:

-DOCSTART- -X- -X- O

CRICKET NNP I-NP O
- : O O
LEICESTERSHIRE NNP I-NP I-ORG
TAKE NNP I-NP O
OVER IN I-PP O
AT NNP I-NP O

Note the presence of the surface forms of tokens at the beginning of the lines.

Parameters

input_file – Location of input file to read.
space_before_punct – If True, add whitespace before punctuation characters when reconstructing the text of the document.
column_names – Names for the metadata columns that come after the token text. These names will be used to generate the names of the dataframe that this function returns.
iob_columns – Mask indicating which of the metadata columns after the token text should be treated as being in IOB format. If a column is in IOB format, the returned dataframe will contain two columns, holding IOB2 tags and entity type tags, respectively. For example, an input column “ent” will turn into output columns “ent_iob” and “ent_type”.

Returns

A list containing, for each document in the input file, a separate pd.DataFrame of four columns:

span: Span of each token, with character offsets. Backed by the concatenation of the tokens in the document into a single string with one sentence per line.

ent_iob IOB2-format tags of tokens, corrected so that every entity begins with a “B” tag.

ent_type: Entity type names for tokens tagged “I” or “B” in the ent_iob column; None everywhere else.

text_extensions_for_pandas.io.conll.conll_u_to_dataframes(input_file: str, column_names: Optional[List[str]] = None, iob_columns: Optional[List[bool]] = None, has_predicate_args: bool = True, space_before_punct: bool = False, merge_subtokens: bool = False, merge_subtoken_separator: str = '|', numeric_cols: Optional[List[str]] = None, metadata_fields: Optional[Dict[str, str]] = None, separate_sentences_by_doc: bool = False) → List[pandas.core.frame.DataFrame][source]¶

Parses a file from

Parameters

input_file – Location of input file to read.
space_before_punct – If True, add whitespace before punctuation characters when reconstructing the text of the document.
column_names – Optional. Names for the metadata columns that come after the token text. These names will be used to generate the names of the dataframe that this function returns. If no value is provided, these default to the list returned by default_conll_u_field_names(), which is also the format defined at https://universaldependencies.org/docs/format.html.
iob_columns – Mask indicating which of the metadata columns after the token text should be treated as being in IOB format. If a column is in IOB format, the returned dataframe will contain two columns, holding IOB2 tags and entity type tags, respectively. For example, an input column “ent” will turn into output columns “ent_iob” and “ent_type”. By default in CONLL_U or EWT formats this is all false.
has_predicate_args – Whether or not the file format includes predicate args. True by default, and should support most files in the conllu format, but will assume that any tabs in the last element are additional predicate arguments
merge_subtokens – dictates how to handle tokens that are smaller than one word. By default, we keep the subtokens as two seperate entities, but if this is set to True, the subtokens will be merged into a single entity, of the same length as the token, and their attributes will be concatenated
merge_subtoken_separator – If merge subtokens is selected, concatenate the attributes with this separator, by default ‘|’
numeric_cols – Optional: Names of numeric columns drawn from column_names, plus the default “built-in” column name line_num. Any column whose name is in this list will be considered to hold numeric values. Column names not present in the column_names argument will be ignored. If no value is provided, then the return value of default_conll_u_numeric_cols() will be used.
metadata_fields – Optional. Types of metadata fields you want to store from the document, in the form of a dictionary: tag_in_text -> “pretty” tag (i.e. what you want to show in the output). If no value is provided, then the return value of default_ewt_metadata() will be used.
separate_sentences_by_doc – Optional. If False (the default behavior), use the document boundaries defined in the CoNLL-U file. If True, then treat each sentence in the input file as a separate document.

Returns

A list containing, for each document in the input file, a separate pd.DataFrame of four columns:

span: Span of each token, with character offsets. Backed by the concatenation of the tokens in the document into a single string with one sentence per line.

ent_iob: IOB2-format tags of tokens, corrected so that every entity begins with a “B” tag.

ent_type: Entity type names for tokens tagged “I” or “B” in the ent_iob column; None everywhere else.

text_extensions_for_pandas.io.conll.decode_class_labels(class_labels: Iterable[str])[source]¶

Decode the composite labels that add_token_classes() creates.

Parameters: class_labels – Iterable of string class labels like “I-LOC”
Returns: A tuple of (IOB2 tags, entity type strings) corresponding to the class labels.

text_extensions_for_pandas.io.conll.default_conll_u_field_names() → List[str][source]¶

Returns: The default set of field names (not including the required first two fields) to use when parsing CoNLL-U files.

text_extensions_for_pandas.io.conll.default_conll_u_numeric_cols() → List[str][source]¶

text_extensions_for_pandas.io.conll.default_ewt_metadata() → Dict[str, str][source]¶

Returns: What metadata to log from conllu (especially ewt) files. This is a dict as follows: tag_in_file -> desired name. When the tag in the file is seen in a comment, the following value will be stored and be assumed to apply to all elements in that document.

text_extensions_for_pandas.io.conll.iob_to_spans(token_features: pandas.core.frame.DataFrame, iob_col_name: str = 'ent_iob', span_col_name: str = 'span', entity_type_col_name: str = 'ent_type')[source]¶

Convert token tags in Inside–Outside–Beginning (IOB2) format to a series of TokenSpan objects of entities. See See wikipedia for more information on the IOB2 format.

Parameters

token_features – DataFrame of token features in the format returned by make_tokens_and_features().
iob_col_name – Name of a column in token_features that contains the IOB2 tags as strings, “I”, “O”, or “B”.
span_col_name – Name of a column in token_features that contains the tokens as a SpanArray.
entity_type_col_name – Optional name of a column in token_features that contains entity type information; or None if no such column exists.

Returns

A pd.DataFrame with the following columns:

span: Span (with token offsets) of each entity
<value of entity_type_col_name>: (optional) Entity type

text_extensions_for_pandas.io.conll.make_iob_tag_categories(entity_types: List[str]) → Tuple[pandas.core.dtypes.dtypes.CategoricalDtype, List[str], Dict[str, int]][source]¶

Enumerate all the possible token categories for combinations of IOB tags and entity types (for example, I + "PER" ==> "I-PER"). Generate a consistent mapping from these strings to integers.

Parameters

entity_types – Allowable entity type strings for the corpus

Returns

A triple of:

Pandas CategoricalDtype
mapping from integer to string label, as a list. This mapping is guaranteed to be consistent with the mapping in the Pandas CategoricalDtype in the first return value.
mapping string label to integer, as a dict; the inverse of the second return value.

text_extensions_for_pandas.io.conll.maybe_download_conll_data(target_dir: str) → Dict[str, str][source]¶

Download and cache a copy of the CoNLL-2003 named entity recognition data set.

NOTE: This data set is licensed for research use only. Be sure to adhere to the terms of the license when using this data set!

Parameters: target_dir – Directory where this function should write the corpus files, if they are not already present.
Returns: Dictionary containing a mapping from fold name to file name for each of the three folds (train, test, dev) of the corpus.

text_extensions_for_pandas.io.conll.maybe_download_dataset_data(target_dir: str, document_url: str, fname: Optional[str] = None) → Union[str, List[str]][source]¶

If the file found at the url is not found in the target directory, downloads it, and saves it to that place in downloads. Returns the path to the file. If a zip archive is downloaded, only files that are not already in the target directory will be fetched, and if an alternate_name is given only that file will be operated on. Note if a Zip archive is downloaded it will be unpacked so verify that the url being used is safe.

Parameters

target_dir – Directory where this function should write the document
document_url – url from which to download the docuemnt. If no alternate name is specified, it is assumed that the string after the last slash is the name of the file.
fname – if given, the name of the file that is checked in the target directory, as well as what is used to save the file if no such file is found. If a zip file is downloaded, and a file of this name exists in in the archive, only it will be extracted.

Returns

the path to the file, or None if downloading was not successful If the file found at the url is not found in the target directory, downloads it, and saves it to that place in downloads

text_extensions_for_pandas.io.conll.spans_to_iob(token_spans: Union[text_extensions_for_pandas.array.token_span.TokenSpanArray, List[text_extensions_for_pandas.array.token_span.TokenSpan], pandas.core.series.Series], span_ent_types: Optional[Union[str, Iterable, numpy.ndarray, pandas.core.series.Series]] = None) → pandas.core.frame.DataFrame[source]¶

Convert a series of TokenSpan objects of entities to token tags in Inside–Outside–Beginning (IOB2) format. See wikipedia for more information on the IOB2 format.

Parameters

token_spans – An object that can be converted to a TokenSpanArray via TokenSpanArray.make_array(). Should contain TokenSpan objects aligned with the target tokenization. All spans must be from the same document. Usually you create this array by calling TokenSpanArray.align_to_tokens().
span_ent_types – List of entity type strings corresponding to each of the elements of token_spans, or None to indicate null entity tags.

Returns

A pd.DataFrame with two columns:

”ent_iob”: IOB2 tags as strings “ent_iob”
”ent_type”: Entity type strings (or NaN values if ent_types is None)

Pandas APIs for SpaCy Data Structures¶

The io.spacy module contains I/O functions related to the SpaCy NLP library.

text_extensions_for_pandas.io.spacy.make_tokens(target_text: str, tokenizer: spacy.tokenizer.Tokenizer = None) → pandas.core.series.Series[source]¶

Parameters

target_text – Text to tokenize
tokenizer – Preconfigured spacy.tokenizer.Tokenizer object, or None to use the tokenizer returned by simple_tokenizer()

Returns

The tokens (and underlying text) as a Pandas Series wrapped around a SpanArray value.

text_extensions_for_pandas.io.spacy.make_tokens_and_features(target_text: str, language_model, add_left_and_right=False) → pandas.core.frame.DataFrame[source]¶

Parameters

target_text – Text to analyze
language_model – Preconfigured spaCy language model (spacy.language.Language) object
add_left_and_right – If True, add columns “left” and “right” containing references to previous and next tokens.

Returns

A tuple of two dataframes:

The tokens of the text plus additional linguistic features that the language model generates, represented as a pd.DataFrame.
A table of named entities identified by the language model’s named entity tagger, represented as a pd.DataFrame.

text_extensions_for_pandas.io.spacy.render_parse_tree(token_features: pandas.core.frame.DataFrame, text_col: str = 'span', tag_col: str = 'tag', label_col: str = 'dep', head_col: str = 'head') → None[source]¶

Display a DataFrame in the format returned by make_tokens_and_features() using displaCy’s dependency tree renderer.

See https://spacy.io/usage/visualizers for more information on displaCy.

Parameters

token_features – A subset of a token features DataFrame in the format returned by make_tokens_and_features(). Must at a minimum contain the head column and an integer index that corresponds to the ints in the head column.
text_col – Name of the column in token_features from which the ‘covered text’ label for each node of the parse tree should be extracted, or None to leave those labels blank.
tag_col – Name of the column in token_features from which the ‘tag’ label for each node of the parse tree should be extracted; or None to leave those labels blank.
label_col – Name of the column in token_features from which the label for each edge of the parse tree should be extracted; or None to leave those labels blank.
head_col – Name of the column in token_features from which the head node of each parse tree node should be extracted.

text_extensions_for_pandas.io.spacy.simple_tokenizer() → spacy.tokenizer.Tokenizer[source]¶

Returns

Singleton instance of a SpaCy tokenizer that splits text on all whitespace and all punctuation characters.

This type of tokenization is recommended for dictionary and regular expression matching.

text_extensions_for_pandas.io.spacy.token_features_to_tree(token_features: pandas.core.frame.DataFrame, text_col: str = 'span', tag_col: str = 'tag', label_col: str = 'dep', head_col: str = 'head')[source]¶

Convert a DataFrame in the format returned by make_tokens_and_features() to the public input format of displaCy’s dependency tree renderer.

Parameters

token_features – A subset of a token features DataFrame in the format returned by make_tokens_and_features(). Must at a minimum contain the head column and an integer index that corresponds to the ints in the head column.
text_col – Name of the column in token_features from which the ‘covered text’ label for each node of the parse tree should be extracted, or None to leave those labels blank.
tag_col – Name of the column in token_features from which the ‘tag’ label for each node of the parse tree should be extracted; or None to leave those labels blank.
label_col – Name of the column in token_features from which the label for each edge of the parse tree should be extracted; or None to leave those labels blank.
head_col – Name of the column in token_features from which the head node of each parse tree node should be extracted.

Returns

Native Python type representation of the parse tree in a format suitable to pass to displacy.render(manual=True ...) See https://spacy.io/usage/visualizers for the specification of this format.

Support for BERT and similar language models¶

The io.bert module contains functions for working with transformer-based language models such as BERT, including managing the special tokenization and windowing that these models require.

This module uses the transformers library to implement tokenization and embedding generation. You will need that library in your Python path to use the functions in this module.

text_extensions_for_pandas.io.bert.add_embeddings(df: pandas.core.frame.DataFrame, bert: Any, overlap: int = 32, non_overlap: int = 64) → pandas.core.frame.DataFrame[source]¶

Add BERT embeddings to a DataFrame of BERT tokens.

Parameters

df – DataFrame containing BERT tokens, as returned by make_bert_tokens() Must contain a column input_id containing token IDs.
bert – PyTorch-based BERT model from the transformers library
overlap – (optional) how much overlap there should be between adjacent windows
non_overlap – (optional) how much non-overlapping content between the overlapping regions there should be at the middle of each window?

Returns

A copy of df with a new column, “embedding”, containing BERT embeddings as a TensorArray.

Note

PyTorch must be installed to run this function.

text_extensions_for_pandas.io.bert.align_bert_tokens_to_corpus_tokens(spans_df: pandas.core.frame.DataFrame, corpus_toks_df: pandas.core.frame.DataFrame, spans_df_token_col: str = 'span', corpus_df_token_col: str = 'span', entity_type_col: str = 'ent_type') → pandas.core.frame.DataFrame[source]¶

Expand entity matches from a BERT-based model so that they align with the corpus’s original tokenization.

Parameters

spans_df – DataFrame of extracted entities. Must contain two columns with span and entity type information, respectively. Other columns ignored.
corpus_toks_df – DataFrame of the corpus’s original tokenization, one row per token. Must contain a column with character-based spans of the tokens.
spans_df_token_col – the name of the column in spans_df containing its tokenization. By default, 'span'
corpus_df_token_col – the name of the column in corpus_toks_df that contains its tokenization. By default `'span'
entity_type_col – the name of the column in spans_df that contains the entity types of the elements

Returns

A new DataFrame with schema ["span", "ent_type"], where the “span” column contains token-based spans based off the corpus tokenization in corpus_toks_df["span"].

text_extensions_for_pandas.io.bert.conll_to_bert(df: pandas.core.frame.DataFrame, tokenizer: Any, bert: Any, token_class_dtype: pandas.core.dtypes.dtypes.CategoricalDtype, compute_embeddings: bool = True, overlap: int = 32, non_overlap: int = 64) → pandas.core.frame.DataFrame[source]¶

Parameters

df – One DataFrame from the conll_2003_to_dataframes() function, representing the tokens of a single document in the original tokenization.
tokenizer – BERT tokenizer instance from the transformers library
bert – PyTorch-based BERT model from the transformers library
token_class_dtype – Pandas categorical type for representing token class labels, as returned by make_iob_tag_categories()
compute_embeddings – True to generate BERT embeddings at each token position and add a column “embedding” to the returned DataFrame with the embeddings
overlap – (optional) how much overlap there should be between adjacent windows for embeddings
non_overlap – (optional) how much non-overlapping content between the overlapping regions there should be at the middle of each window?

Returns

A version of the same DataFrame, but with BERT tokens, BERT embeddings for each token (if compute_embeddings is True), and token class labels.

text_extensions_for_pandas.io.bert.make_bert_tokens(target_text: str, tokenizer) → pandas.core.frame.DataFrame[source]¶

Tokenize the indicated text for BERT embeddings and return a DataFrame with one row per token.

Param

target_text: string to tokenize

Param

tokenizer: A tokenizer that is a subclass of huggingface transformers PreTrainingTokenizerFast which supports encode_plus with return_offsets_mapping=True.

Returns

pd.DataFrame with following columns:

”id”: unique integer ID for each token
”span”: span of the token (with offsets measured in characters)
”input_id”: integer ID suitable for input to a BERT embedding model
”token_type_id”: list of token type ids to be fed to a model
”attention_mask”: list of indices specifying which tokens should be attended to by the model
”special_tokens_mask”: True if the token is a zero-length special token such as “start of document”

text_extensions_for_pandas.io.bert.seq_to_windows(seq: numpy.ndarray, overlap: int, non_overlap: int) → Dict[str, numpy.ndarray][source]¶

Convert a variable-length sequence into a set of fixed length windows, adding padding as necessary.

Usually this function is used to prepare batches of BERT tokens to feed to a BERT model.

Parameters

seq – Original variable length sequence, as a 1D numpy array
overlap – How much overlap there should be between adjacent windows
non_overlap – How much non-overlapping content between the overlapping regions there should be at the middle of each window?

Returns

Dictionary with the keys “input_ids” and “attention_masks” mapped to NumPy arrays as described below.

”input_ids” (where d is the returned dictionary): 2D np.ndarray of fixed-length windows
”attention_masks”: 2D np.ndarray of attention masks (1 for tokens that are NOT masked, 0 for tokens that are masked) to feed into your favorite BERT-like embedding generator.

text_extensions_for_pandas.io.bert.windows_to_seq(seq: numpy.ndarray, windows: numpy.ndarray, overlap: int, non_overlap: int) → numpy.ndarray[source]¶

Inverse of seq_to_windows(). Convert fixed length windows with padding to a variable-length sequence that matches up with the original sequence from which the windows were computed.

Usually this function is used to convert the outputs of a BERT model back to a format that aligns with the original tokens.

Parameters

seq – Original variable length sequence to align with, as a 1D numpy array
windows – Windowed data to align with the original sequence. Usually this data is the result of applying a transformation to the output of seq_to_windows()`
overlap – How much overlap there is between adjacent windows
non_overlap – How much non-overlapping content between the overlapping regions there should be at the middle of each window?

Returns

A 1D np.ndarray containing the contents of windows that correspond to the elements of seq.