ashpool package

Submodules

ashpool.ashpool module

ashpool.dummy module

Module contents

ashpool.attach_temp_id(dframe, field_list=None, id_label=u'tempid', append_uuid=False, prefix=u'')

Attach an column with ID created from field_list and optionally add uuid

Arguments:
dframe {pandas.DataFrame} – Input dataframe.
Keyword Arguments:

field_list {list} – List of columns to use to build tempid. (default: {None})

append_uuid {bool} – True appends uuid. (default: {False})

Returns:
pandas.DataFrame – Dataframe with tempid using flds.
ashpool.attach_unique_id(dframe, threshold=1.0)

Return a new dataframe based on input dframe with unique fields attached.

Arguments:

dframe {pandas.DataFrame} – Source dataframe

threshold {float} – Specify how unique 0.0 to 1.0 (most unique)

Returns:
pandas.DataFrame – Dataframe with a new id (u_id), which meets the threshold for uniqueness. u_id is based on a combination of non-numeric series, trying to meet the uniqueness threshold with the fewest number of series.
ashpool.best_id_pair(dframe_l, dframe_r, threshold=0.5)

Return df showing which IDs are best for matching two dfs.

Arguments:

dframe_l {pandas.DataFrame} – Left dataframe.

dframe_r {pandas.DataFrame} – Right dataframe.

Keyword Arguments:
threshold {float} – Value between 0 and 1 that represents minimum coveredness (default: {0.5})
Returns:
pandas.DataFrame – Dataframe showing best IDs to use to align source dataframes.
ashpool.check_coveredness(dframe_l, dframe_r)

Returns ratings of coveredness for columns in dframe_l

Arguments:

dframe_l {pandas.DataFrame} – Source dataframe.

dframe_r {pandas.DataFrame} – Target dataframe.

Returns:
pandas.DataFrame – Dataframe showing statistics regard each columns coveredness.
ashpool.completeness(srs)

Return completeness score for series - i.e., the percentage of non-null values in a series.

Arguments:
srs {pandas.Series}
Returns:
float
ashpool.coveredness(srs_l, srs_r)

Returns percentage of srs_l members that can be found in srs_r

Arguments:

srs_l {pandas.Series} – Source series.

srs_r {pandas.Series} – Target series.

Returns:
float – Percentage of srs_l members that can be found in srs_r
ashpool.cum_uniq(dframe, flds=None)

Return list of incremental uniqueness as tempid is created based on flds.

Arguments:
dframe {pandas.DataFrame} – Source dataframe.
Keyword Arguments:
flds {list} – List of columnn names to be used for create tempid. (default: {None})
Returns:
list – List of floats representing incremental addition to uniqueness as more columns are used to create a tempid.
ashpool.depiction(srs)

Returns description (depiction) of series.

Arguments:
srs {pandas.Series}
Returns:
pandas.DataFrame
ashpool.differ(dframe_l, dframe_r, left_on, right_on, fields_l=None, fields_r=None, show_diff=False, show_ratio=False, show_data=True, tol_pct=0.0, tol_abs=0.0, depict=False, **kwargs)

Returns dataframe showing comparison between fields_l and fields_r. Dataframes are first aligned using left_on and right_on.

Arguments:

dframe_l {pandas.DataFrame} – Left dataframe

dframe_r {pandas.DataFrame} – right dataframe

left_on {list} – list of series names

right_on {list} – list of series names

Keyword Arguments:

fields_l {list} – List of series names to compare (default: {None})

fields_r {list} – List of series to compare (default: {None})

show_diff {bool} – If true return difference between comparison series (default: {False})

show_ratio {bool} – If true return ratio between comparison series (default: {False})

show_data {bool} – If true return data series in returned results (default: {True})

tol_pct {float} – Tolerance in percentage terms when considering matches in numerical data (default: {0})

tol_abs {float} – Tolerance in units when considering matches in numerical data (default: {0})

depict {bool} – If true return stats regarding differ results per comparison pair (depiction) (default: {False})

Returns:
pandas.DataFrame – Dataframe showing comparison between fields_l and fields_r.
ashpool.get_combos(lst)

Returns list of combinations of members of list.

Arguments:
lst {list} – List of strings
Returns:
list – List of combinations
ashpool.get_dtypes(dframe)

return dtypes and kinds by column names (fld)

ashpool.get_most_coveredness(srs_l, dframe_r, top_limit=3)

Returns columns that most cover source series

Arguments:

srs_l {pandas.Series} – Input series.

dframe_r {pandas.DataFrame} – Target dataframe to search.

Keyword Arguments:
top_limit {int} – Maximum number of column names (default: {3})
Returns:
list – List of columns from dframe_r that most cover srs_l.
ashpool.get_sorted_fields(dframe)

Returns list of fields sorted by most_complete, most_unique, and non_object

Arguments:
dframe {pandas.DataFrame} – Input dataframe.
Returns:
dict – Dictionary of lists with fields sorted by completeness and uniqueness. Also a list for fields with are non_object, which are not ranked for completeness or uniqueness.
ashpool.get_unique_fields(dframe, candidate_flds, threshold=1.0, max_member_length=30, show_all=False)

Return list of fields that combine to create an ID that has uniqueness >= threshold.

Arguments:

dframe {pandas.DataFrame} – Input dataframe.

candidate_flds {list} – List of column names.

Keyword Arguments:

threshold {float} – Uniqueness threshold where 1 is perfectly unique (default: {1})

max_member_length {int} – Used to filter out columns which have members that are too lengthy. (default: {30})

show_all {bool} – Show all results even if uniqueness does not meet threshold. (default: {False})

Returns:
list – List of column names that combine to create a unique ID that meets the uniqueness threshold. Stops looking after finding the first list that meets threshold.
ashpool.has_name_match(srs_l, dframe_r)

Returns True if srs_l name found in dframe_r

Arguments:

srs_l {pandas.Series} – Source series.

dframe_r {pandas.DataFrame} – Dataframe to search.

Returns:
bool – True if srs_l.name is found in dframe_r.columns.
ashpool.jaccard_similarity(srs_l, srs_r)

Returns the jaccard similarity between two lists

ashpool.leven_dist(x, y)

Returns Levenshtein distance for two strings and a null if not valid strings.

Arguments:

x {str} – First string.

y {str} – Second string.

Returns:
long – Levenshtein distance
ashpool.longest_element(srs)

return the max len() of any element in series.

Arguments:
srs {pandas.Series}
Returns:
float
ashpool.make_good_label(x_value)

Return something that is a better label.

Arguments:
x_value {string} – or something that can be converted to a string
ashpool.mash(dframe, flds=None, keep_zeros=False)

Returns df of non-null and non-zero on flds

Arguments:
dframe {pandas.DataFrame} – Input dataframe
Keyword Arguments:

flds {list} – List of column nmaes (default: {None})

keep_zeros {bool} – True will keep zeros. (default: {False})

Returns:
pandas.DataFrame – Dataframe with rows removed if null or zero on column[flds].
ashpool.oneness(srs_l, srs_r)

TODO

ashpool.rate_series(dframe)

return ratings of fields for completeness and uniqueness

Arguments:
dframe {pandas.DataFrame} – Input dataframe.
Returns:
pandas.DataFrame – Dataframe with statistics regarding the quality of columns as identifiers.
ashpool.reconcile(dframe_l, dframe_r, fields_l=None, fields_r=None, show_diff=True, show_ratio=False, show_data=True, tol_pct=0.0, tol_abs=0.0, depict=False, breaks_only=False)

Aligns and compares two dataframes

Arguments:

dframe_l {pandas.DataFrame} – left dataframe

dframe_r {pandas.DataFrame} – right dataframe

Keyword Arguments:

fields_l {list} – list of columns names to compare from dframe_l (default: {None})

fields_r {list} – list of columns names to compare from dframe_r (default: {None})

show_diff {bool} – whether or not to include a calculation of the difference in results (default: {True})

breaks_only {bool} – return only those rows that are not matched (default: {True})

Returns:
dataframe – shows results of the comparison
ashpool.suggest_id_pairs(dframe_l, dframe_r, threshold=0.5, incl_all_dtypes=False, incl_all_pairs=False)

Suggest matching series from two dfs.

Arguments:

dframe_l {pandas.DataFrame} – Left dataframe.

dframe_r {pandas.DataFrame} – Right dataframe.

Keyword Arguments:

threshold {float} – Value between 0 and 1 that represents minimum coveredness. (default: {0.5})

incl_all_dtypes {bool} – Try to use all dtypes (not just object) if True. (default: {False})

incl_all_pairs {bool} – Show all pairs regardless of threshold. (default: {False})

Returns:
pandas.DataFrame – Statistics regarding which pairs of columns to use as IDs and their score (id_scr).
ashpool.uniqueness(srs)

return uniqueness score for series - i.e., percentage of unique values in series (excl. nulls).

Arguments:
srs {pandas.Series}
Returns:
float