It is great that pandas has implemented lots of methods to enable method chaining during the data manipulation process. Pipe is a flexible method to accommodate customized functions during pandas operations. INFO - get_subject_rank,Index(, dtype='object') The codes above are modified from Tom’s codes. are carried after the usage of decorators. Note: wraps is used to eliminate the side effect of decorators so that the name, docstring, arguments list, etc. The logging information are also printed below for reference. Here, shape & columns are returned using log_shape & log_columns. By using decorators & logging together, any properties of dataframe can be returned in log files when specified in decorators.
A decorators is a function that extends the behavior of wrapped function without explicitly modifying it. No worries! In this post, the author had provided a great way to tackle this problem - decorators. Some critics might have concerns that it is hard to debug with long chaining processes due to the lack of intermediate results returned. Thus, a tuple - (function, “the argument of data”) - is passed to point out that which argument is the data to apply the function on. In this case, df is the second argument in the calling. The two arguments of add_score are swapped with each other. Input_df = input_df.assign(new_score=lambda x: x.score+added_score) There is no need to specify input_df in the calling in pipe. Here is an example of a function that modifies scores - add_score. When calling functions in pipe, the first argument of the function by default is the dataframe/series applied by pipe. Oftentimes, the intermediate results are not important since the goal of data manipulation is to get the final data clean. This situation is common in data science as there are numerous processes involved in data manipulation. It clearly shows the sequence of the execution and the arguments without the need to nest. Shipping_info = shipping(new_order, "address")īilling_info = billing(shipping_info, "credit_card")Ĭompleted_order = place_order(billing_info)įor the same process, using method chaining/ pipe makes the process readable and easily recognizable the argument of each function call. That’s why I wouldn’t choose this way if I have alternatives. Also, the intermediate results are sometimes one-time results and not used in the later part of process. From my personal experience, the second one is harder for me as it requires giving meaningful names to the intermediate results otherwise hard to be recognized later. The first one that uses nested functions heavily is hard to read without proper formatting and hard to recognize the argument of each function. The following two examples shows common ways of calling multiple functions consecutively. There are 5 functions add_to_cart, checkout, shipping, billing, and place_order used to complete a transaction by customers. Let’s use online shipping as an example to show a different approach combining multiple processes in a row. By method chaining, the relationships among operations can be shown in a clearer format. It is hard to read the functions & arguments at first glance. Usually, three functions are nested in the sequence of calling. In the documentation of pandas, there are 3 functions: h(df), g(df,arg1=a), f(df,arg2=b, arg3=c) applied on df in this order. Let’s look into an example here to show its benefits. By using pipe, multiple processes can be combined with method chaining without nesting. It is part of the methods that enable method chaining. Pipe is a method in pandas.DataFrame capable of passing existing functions from packages or self-defined functions to dataframe.