Framework for Data Science Solutions

7 min readJun 5, 2022

Pandas and Scikit-Learn are Python libraries that are used for data analysis, machine learning, and to create end-to-end software solutions.

There is a framework that I like to follow to create any data science solution for a business problem. A framework is important as it helps navigate through complexity. Following are the steps of the framework.

Framework:

understand the business problem, hypothesis
decide on business metric
read the data
explore and inspect the data
pre process the data
create the metrics evaluation
model or analyze the data
connecting pieces together and automate

If you follow this framework, you will create good data science solutions.

Now to follow the framework (which is a mental framework), you need to use Tools. Pandas is a Tool which helps with this framework. There might be other tools out there which might be also well suited but Pandas is the best one.

Understanding the business problem:

Every solution starts with a problem. In business, if one does not understand the business problem, they end up creating a solution that is not right. You might create a technical solution but not a business solution. Thus, we must understand and frame our business problem.

Usually business problems start very vague. Example: A online real estate website is facing challenges with getting users to book home tours.

When the problem is vague, and needs more detailed framing, always start with asking questions to know more. Example questions: “are there already good amount of users on the website or it is it anew website?”, “is there already an existing way that the users are aware of booking tours on the website?” “is there any metric keeping track of this”.

After asking good questions, let’s say you reframe the business problem ,but now with much more detail

Example of business problem after clarifying details:A online real estate website is facing challenges with getting users to book home tours. The website is 5 year old and has a good user base. The "book tours" option is shown as a button on the home listing page. The metric that keeps track of this problem is "total tours books" and "average tours books per user".

The business problem will either be a prediction problem or a casual inference problem. Both are different and require different approaches, so it is important you understand which one you are dealing with.

Framing a high level hypothesis:

After the business problem is understood and detailed, next step is to create a hypothesis of a solution, at a high level. Continuing the example above, our solution could be hypothesis like below:

Hypothesis 1: We can create a model to find highly correlated user and home activity behavior that lead to booking a tour.Hypothesis 2: The low number of "total tours books" and "average tours books per user" can be improved via creating an email campaign.Hypothesis 3: The low number of "total tours books" and "average tours books per user" can be improved via having a "book tour" option on the home page.Hypothesis 4: The low number of "total tours books" and "average tours books per user" can be improved via having a chat bot prompt a home tour.Hypothesis 5: The low number of "total tours books" and "average tours books per user" can be improved via placing the "book tour" button on the upper right hand of the home listing picture".

Deciding on a metric:

Before you dive in too deep into the solving the exact problem, it is important to have the north start business metric outlined. In the case above, for example, the business metric can be “total tours books” and “average tours books per user”. This is what the business problem to optimize for. Of course, there will be guardrail metrics along with this such as “daily active users”.

Reading data:

Reading in data is most of the time the very first step that takes place in a solution. You will be reading in a lot of data. It could be csv files, text files, parquet files etc.

#csv files
dataframe = pd.read_csv(“file_path/data_file.csv, sep=”,“)
dataframe = pd.read_csv(“file_path/data_file.csv, sep=”|“)#text files
dataframe = pd.read_csv(“file_path/data_file.txt, sep=”\t“)#json files
dataframe = pd.read_json(“file_path/data_file.json, sep=”\t“)

Exploring and Inspecting data:

Check distribution of data, high level:

dataframe.describe()

Visualizing distribution of data:

import matplotlib.pyplot as pltplt.hist(dataframe[“column1”])
plt.show()Histogram ploty-axis^     |----|
|   __|    |
|  |  |    |
|__|  |    |___
|__|__|____|__ |_______> x-axis
Check how the distribution looks. Is it Normal? Is it bi-modal? It has extreme outliers? Depending on that, your further analysis would need to adjust.Scatter plotplt.plot(dataframe[“column1”], dataframe[“column2”])
plt.show()y-axis^
|          '  ' ' '
|      ' ' ' '
|   ' '
| '' 
|'__________________> x-axisWhat relationship do you see between two variables? Is it linearly correlation? or is it non-linear? is there a visible pattern? Depending on that, your further analysis would need to adjust.

Preprocessing data:

After reading the data, we need to pre-process the data or clean it. This requires first understanding the shape and high level details of the data and accordingly performing changes.

Here are a few example pre-processing and cleaning:

Filtering data:

when you need filter and drop rows that do not meet a condition:dataframe_filtered = dataframe[data_frame_csv[“column_1”] < 2000]when you need to filtering and impute data that do not meet a condition:dataframe_filtered = dataframe.where(data_frame_csv[“column_1”] < 2000, other=0)a little more swiss knife, impute with a dynamic calculation:dataframe_filtered = dataframe.where(data_frame_csv[“column_1”] < 2000, other=lambda x:x * 2)

Change data for specific rows that meet a condition:

When you need to change values for specifc rows that meet a condition you have. This where loc — a way to index rows comes handy

dataframe_imputed = dataframe.loc[dataframe[‘column1’ > 10, ‘column1’] = 0
dataframe_imputed = dataframe.loc[dataframe[‘column1’ < 10, ‘column1’] = 1alternatively, you can use np.where() for changing values for rows that meet a specific condition. This more useful as it always allows to provide values for rows that do not meet a condition.dataframe[‘column1’] = np.where(dataframe[‘column1’] < 10, 1, dataframe[‘column1’])

Change data type of columns:

dataframe[“column_1”] = dataframe[“column_1”].astype(str)dataframe[“column_1”] = dataframe[“column_1”].astype(int)Change column names:

if you want to change the names of the columns in the dataframe

dataframe.columms = [“column_name1”, “column_name2”, ….]

Denormalization

Denormalization is big theme in pre processing.

joining different dataframes

dataframe1.join(dataframe2, how=”inner”, on=”column1")dataframe1.join(dataframe2, how=”inner”, left_on=”column1", right_on=”column2", r_suffix=”_y”)

Concatenating

dataframe1.concat(dataframe2, axis=0) — vertically concatenate[ df1  ]   |[ df2  ]dataframe1.concat(dataframe2, axis=1) — horizontal concatenate[ df1  ] -- [  df2 ]

Aggregation

dataframe1.groupby(‘column1’, as_index=False).agg({‘column2’: ‘sum’. ‘column3’: ‘mean’})

Bucketing Columns

can use loc to create Buckets for a column based on certain conditions.

Handling Null values

dataframe.fillna(0) — fill all na values with 0dataframe.dropna() — drop rows with na values

Model or analyze

Simple correlationdataframe['column1'].corr(dataframe['column2'])dataframe['column1'].corr(dataframe['column2'], method_type="pearson")Linear Regressionfrom sklearn.linear_model import LinearRegressionmodel = LinearRegression().fit(np.array(dataframe['columns1234']).reshape(-1, 1),np.array(dataframe['column5']).reshape(-1, 1))model.coef_output = array([[0.99569654]])Logistic Regressionfrom sklearn.linear_model import LogisticRegressionmodel = LogisticRegression().fit(np.array(dataframe['columns1234']).reshape(-1, 1),np.array(dataframe['column5']).reshape(-1, 1))#column5 above MUST be binarymodel.coef_output = array([[0.99569654]])Decision Treefrom sklearn import treemodel = tree.DecisionTreeClassifier().fit(np.array(dataframe['columns1234']).reshape(-1, 1),np.array(dataframe['column5']).reshape(-1, 1))model.predict(np.array(dataframe['column5']).reshape(-1, 1))
model.predict_proba(np.array(dataframe['column5']).reshape(-1, 1))

Metrics and Evaluation

Be careful with metrics. Metrics guide you. Main metrics and guardrail metrics.

Low level offline model level metrics:Numeric data — RMSE, MAE, R2 score.Classification label data - Accuracy, Precision, Recall, AUC, F-Score.Precision is used when cost of missing out on a actual 1 is not that critical but 1s need to be accurate. Recall is used when missing out on a actual 1 is very critical.For these analysis metric, from sklearn.metrics package offers many popular metrics such as accuracy, precision, recall, RMSE, MAE, R2 score.High level business level metricAcquisition, Engagement, Retention, Revenue. Imagine the customer funnel. Also, cohorts are very important when calculating these metrics.Examples:Let's take an online real estate website as an example business cases.Acquisition metrics examples:- Email Invite Acceptance Rate = Number of sign ups / Number of invites sent (cohort analysis)
- Acceptance Rate = Number of sign ups / Number of invites sentEngagement metrics examples:- Average Sessions Per User
- Average Sessions Time Per User
- Average Home Views Per User
- Average Home Tours Per User
- Average Active Homes Sales Per Month
- Average Home Views
- Average Home SalesRetention metrics examples:- Daily Active 
       -Users
       -Homes
       -Home Owners- Monthly Active 
       -Users
       -Homes
       -Home Owners- Yearly Active 
       -Users
       -Homes
       -Home OwnersRevenue metrics examples:- Total revenue
- Total monthly revenue
- Average Revenue Per Home Sale
- Average Revenue Per User
- Average Revenue Per Region
Other forms of evaluation include Survivor Analysis:Survivor analysis helps understand how the distribution of "initiation -> event" is for something. This is useful to not find if, but how much time something takes to converge to an event.
Example: how much time does a car tire take to reach it's half life?% of samples - y-axis   ^
80%|                '   
   |             
   |            
5% |          '  '
   |  '   '    
   |__1___2___3__4__5_______> time to half life (years) - x-axis
If you analyze the chart above, you see that majority of the tires reach half life at year 5. You can calculate the area under curve for specific ranges and get the probability density.

Automation

After you have different parts of your solution above ready, you should start thinking of ways to bind the different parts together and automate them. You need a job automation tool like Airflow where you can define flow of execution.

Framework for Data Science Solutions

Written by Srijan Bhushan

No responses yet