ashpool package¶

Subpackages¶

ashpool.tests package

Submodules¶

ashpool.ashpool module¶

ashpool.dummy module¶

Module contents¶

ashpool.attach_temp_id(dframe, field_list=None, id_label=u'tempid', append_uuid=False, prefix=u'')¶

Attach an column with ID created from field_list and optionally add uuid

Arguments:

dframe {pandas.DataFrame} – Input dataframe.

Keyword Arguments:

field_list {list} – List of columns to use to build tempid. (default: {None})

append_uuid {bool} – True appends uuid. (default: {False})

Returns:

pandas.DataFrame – Dataframe with tempid using flds.

ashpool.attach_unique_id(dframe, threshold=1.0)¶

Return a new dataframe based on input dframe with unique fields attached.

Arguments:

dframe {pandas.DataFrame} – Source dataframe

threshold {float} – Specify how unique 0.0 to 1.0 (most unique)

Returns:

pandas.DataFrame – Dataframe with a new id (u_id), which meets the threshold for uniqueness. u_id is based on a combination of non-numeric series, trying to meet the uniqueness threshold with the fewest number of series.

ashpool.best_id_pair(dframe_l, dframe_r, threshold=0.5)¶

Return df showing which IDs are best for matching two dfs.

Arguments:

dframe_l {pandas.DataFrame} – Left dataframe.

dframe_r {pandas.DataFrame} – Right dataframe.

Keyword Arguments:

threshold {float} – Value between 0 and 1 that represents minimum coveredness (default: {0.5})

Returns:

pandas.DataFrame – Dataframe showing best IDs to use to align source dataframes.

ashpool.check_coveredness(dframe_l, dframe_r)¶

Returns ratings of coveredness for columns in dframe_l

Arguments:

dframe_l {pandas.DataFrame} – Source dataframe.

dframe_r {pandas.DataFrame} – Target dataframe.

Returns:

pandas.DataFrame – Dataframe showing statistics regard each columns coveredness.

ashpool.completeness(srs)¶

Return completeness score for series - i.e., the percentage of non-null values in a series.

Arguments:: srs {pandas.Series}
Returns:: float

ashpool.coveredness(srs_l, srs_r)¶

Returns percentage of srs_l members that can be found in srs_r

Arguments:

srs_l {pandas.Series} – Source series.

srs_r {pandas.Series} – Target series.

Returns:

float – Percentage of srs_l members that can be found in srs_r

ashpool.cum_uniq(dframe, flds=None)¶

Return list of incremental uniqueness as tempid is created based on flds.

Arguments:: dframe {pandas.DataFrame} – Source dataframe.
Keyword Arguments:: flds {list} – List of columnn names to be used for create tempid. (default: {None})
Returns:: list – List of floats representing incremental addition to uniqueness as more columns are used to create a tempid.

ashpool.depiction(srs)¶

Returns description (depiction) of series.

Arguments:: srs {pandas.Series}
Returns:: pandas.DataFrame

ashpool.differ(dframe_l, dframe_r, left_on, right_on, fields_l=None, fields_r=None, show_diff=False, show_ratio=False, show_data=True, tol_pct=0.0, tol_abs=0.0, depict=False, **kwargs)¶

Returns dataframe showing comparison between fields_l and fields_r. Dataframes are first aligned using left_on and right_on.

Arguments:

dframe_l {pandas.DataFrame} – Left dataframe

dframe_r {pandas.DataFrame} – right dataframe

left_on {list} – list of series names

right_on {list} – list of series names

Keyword Arguments:

fields_l {list} – List of series names to compare (default: {None})

fields_r {list} – List of series to compare (default: {None})

show_diff {bool} – If true return difference between comparison series (default: {False})

show_ratio {bool} – If true return ratio between comparison series (default: {False})

show_data {bool} – If true return data series in returned results (default: {True})

tol_pct {float} – Tolerance in percentage terms when considering matches in numerical data (default: {0})

tol_abs {float} – Tolerance in units when considering matches in numerical data (default: {0})

depict {bool} – If true return stats regarding differ results per comparison pair (depiction) (default: {False})

Returns:

pandas.DataFrame – Dataframe showing comparison between fields_l and fields_r.

ashpool.get_combos(lst)¶

Returns list of combinations of members of list.

Arguments:: lst {list} – List of strings
Returns:: list – List of combinations

ashpool.get_dtypes(dframe)¶: return dtypes and kinds by column names (fld)

ashpool.get_most_coveredness(srs_l, dframe_r, top_limit=3)¶

Returns columns that most cover source series

Arguments:

srs_l {pandas.Series} – Input series.

dframe_r {pandas.DataFrame} – Target dataframe to search.

Keyword Arguments:

top_limit {int} – Maximum number of column names (default: {3})

Returns:

list – List of columns from dframe_r that most cover srs_l.

ashpool.get_sorted_fields(dframe)¶

Returns list of fields sorted by most_complete, most_unique, and non_object

Arguments:: dframe {pandas.DataFrame} – Input dataframe.
Returns:: dict – Dictionary of lists with fields sorted by completeness and uniqueness. Also a list for fields with are non_object, which are not ranked for completeness or uniqueness.

ashpool.get_unique_fields(dframe, candidate_flds, threshold=1.0, max_member_length=30, show_all=False)¶

Return list of fields that combine to create an ID that has uniqueness >= threshold.

Arguments:

dframe {pandas.DataFrame} – Input dataframe.

candidate_flds {list} – List of column names.

Keyword Arguments:

threshold {float} – Uniqueness threshold where 1 is perfectly unique (default: {1})

max_member_length {int} – Used to filter out columns which have members that are too lengthy. (default: {30})

show_all {bool} – Show all results even if uniqueness does not meet threshold. (default: {False})

Returns:

list – List of column names that combine to create a unique ID that meets the uniqueness threshold. Stops looking after finding the first list that meets threshold.

ashpool.has_name_match(srs_l, dframe_r)¶

Returns True if srs_l name found in dframe_r

Arguments:

srs_l {pandas.Series} – Source series.

dframe_r {pandas.DataFrame} – Dataframe to search.

Returns:

bool – True if srs_l.name is found in dframe_r.columns.

ashpool.jaccard_similarity(srs_l, srs_r)¶: Returns the jaccard similarity between two lists

ashpool.leven_dist(x, y)¶

Returns Levenshtein distance for two strings and a null if not valid strings.

Arguments:

x {str} – First string.

y {str} – Second string.

Returns:

long – Levenshtein distance

ashpool.longest_element(srs)¶

return the max len() of any element in series.

Arguments:: srs {pandas.Series}
Returns:: float

ashpool.make_good_label(x_value)¶

Return something that is a better label.

Arguments:: x_value {string} – or something that can be converted to a string

ashpool.mash(dframe, flds=None, keep_zeros=False)¶

Returns df of non-null and non-zero on flds

Arguments:

dframe {pandas.DataFrame} – Input dataframe

Keyword Arguments:

flds {list} – List of column nmaes (default: {None})

keep_zeros {bool} – True will keep zeros. (default: {False})

Returns:

pandas.DataFrame – Dataframe with rows removed if null or zero on column[flds].

ashpool.oneness(srs_l, srs_r)¶: TODO

ashpool.rate_series(dframe)¶

return ratings of fields for completeness and uniqueness

Arguments:: dframe {pandas.DataFrame} – Input dataframe.
Returns:: pandas.DataFrame – Dataframe with statistics regarding the quality of columns as identifiers.

ashpool.reconcile(dframe_l, dframe_r, fields_l=None, fields_r=None, show_diff=True, show_ratio=False, show_data=True, tol_pct=0.0, tol_abs=0.0, depict=False, breaks_only=False)¶

Aligns and compares two dataframes

Arguments:

dframe_l {pandas.DataFrame} – left dataframe

dframe_r {pandas.DataFrame} – right dataframe

Keyword Arguments:

fields_l {list} – list of columns names to compare from dframe_l (default: {None})

fields_r {list} – list of columns names to compare from dframe_r (default: {None})

show_diff {bool} – whether or not to include a calculation of the difference in results (default: {True})

breaks_only {bool} – return only those rows that are not matched (default: {True})

Returns:

dataframe – shows results of the comparison

ashpool.suggest_id_pairs(dframe_l, dframe_r, threshold=0.5, incl_all_dtypes=False, incl_all_pairs=False)¶

Suggest matching series from two dfs.

Arguments:

dframe_l {pandas.DataFrame} – Left dataframe.

dframe_r {pandas.DataFrame} – Right dataframe.

Keyword Arguments:

threshold {float} – Value between 0 and 1 that represents minimum coveredness. (default: {0.5})

incl_all_dtypes {bool} – Try to use all dtypes (not just object) if True. (default: {False})

incl_all_pairs {bool} – Show all pairs regardless of threshold. (default: {False})

Returns:

pandas.DataFrame – Statistics regarding which pairs of columns to use as IDs and their score (id_scr).

ashpool.uniqueness(srs)¶

return uniqueness score for series - i.e., percentage of unique values in series (excl. nulls).

Arguments:: srs {pandas.Series}
Returns:: float