Project Business Statistics: E-news Express¶

Marks: 60

Define Problem Statement and Objectives¶

Problem Statement¶

E-news Express delivers news electronically, and is wanting to analyze the way users interact with their website to further increase website participation. The executives believe that the design of the current webpage and recommended content are attracting less new monthly subscribers than the past year.

Objectives¶

The company has decided to use A/B testing to examine how a different landing page design would affect subscription and user interaction. Through research, a new landing page was created with a different outline and more relevant content. An experiment was conducted, dividing 100 randomly selected users into two groups of 50. The control group viewed the current landing page, and the treatment group viewed the newly created landing page. Their user interaction data was collected. The data needs to be analyzed at a significance level of 5% to decide whether this new landing page is more effective in increasing the number of new subscribers.

Questions that need to be considered in this analysis:

  • Which landing page do users spend more time on?
  • Which landing page has the higher conversion rate?
  • Does preferred language affect converted status?
  • Does time spent on the new landing page remain the same for users with different languages?

Data Dictionary¶

The dataset includes the following information about each user:

  • user_id (unique for each user)
  • group (control or treatment)
  • landing_page (new or old)
  • time_spent_on_the_page (in minutes)
  • converted (Did the user subscribe or not?)
  • language_preferred (selected language by user)

Import all the necessary libraries¶

In [1]:
#Analyzing and Manipulating Data
import numpy as np #importing numpy libary for manipulating arrays
import pandas as pd #importing pandas library for manipulating datasets

#Data Visualization
import seaborn as sns #importing seaborn library for data visualization
import matplotlib.pyplot as plt #importing matplotlib.pyplot for data visualization

#Statistics/Probability
import scipy.stats as stats #importing scipy.stats library for statistical/probability functions

Reading the Data into a DataFrame¶

In [2]:
#organizing the data in the abtest.csv file into a dataframe called users
users = pd.read_csv('abtest.csv')

Explore the dataset and extract insights using Exploratory Data Analysis¶

In [3]:
#viewing first 5 rows of the users dataframe
users.head()
Out[3]:
user_id group landing_page time_spent_on_the_page converted language_preferred
0 546592 control old 3.48 no Spanish
1 546468 treatment new 7.13 yes English
2 546462 treatment new 4.40 no Spanish
3 546567 control old 3.02 no French
4 546459 treatment new 4.75 yes Spanish
In [4]:
#viewing last 5 rows of the users dataframe
users.tail()
Out[4]:
user_id group landing_page time_spent_on_the_page converted language_preferred
95 546446 treatment new 5.15 no Spanish
96 546544 control old 6.52 yes English
97 546472 treatment new 7.07 yes Spanish
98 546481 treatment new 6.20 yes Spanish
99 546483 treatment new 5.86 yes English

Observations:¶

There are 6 columns in the users dataframe, and each row represents data regarding the landing page interaction of a user.

In [5]:
#viewing the total number of rows and columns in the users dataframe
users.shape
Out[5]:
(100, 6)

Observations:¶

There are 100 rows and 6 columns in the users dataframe.

In [6]:
#viewing the statistical summary of the numerical variables in the users dataframe
#In this case, the only numerical value considered is time_spent_on_the_page
users.describe()
Out[6]:
user_id time_spent_on_the_page
count 100.000000 100.000000
mean 546517.000000 5.377800
std 52.295779 2.378166
min 546443.000000 0.190000
25% 546467.750000 3.880000
50% 546492.500000 5.415000
75% 546567.250000 7.022500
max 546592.000000 10.710000

Observations:¶

Users spent an average of 5.38 minutes on the landing page. The lowest time spent is 0.19 minutes, and the highest time spent is 10.71 minutes.

In [7]:
#viewing further information about the columns and their datatypes
users.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   user_id                 100 non-null    int64  
 1   group                   100 non-null    object 
 2   landing_page            100 non-null    object 
 3   time_spent_on_the_page  100 non-null    float64
 4   converted               100 non-null    object 
 5   language_preferred      100 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 4.8+ KB

Observations:¶

  • user_id is an integer data type, but it is only a unique ID of the user. It does not hold much numerical value.
  • group, landing_page, converted, and language_preferred are all object data types, and also categorical variables.
  • time_spent_on_the_page is a float data type, and a numerical variable.
  • There are 100 non-null values in each column.
In [8]:
#checking for any missing values in the users dataframe
users.isnull().sum()  #viewing the sum of null values in each column
Out[8]:
user_id                   0
group                     0
landing_page              0
time_spent_on_the_page    0
converted                 0
language_preferred        0
dtype: int64

Observations:¶

There are no missing values in the users dataframe.

In [9]:
#checking for any duplicate rows in the users dataframe
users.duplicated().sum() #viewing the sum of the duplicated rows in the users dataframe
Out[9]:
0

Observations:¶

There are no duplicate rows in the users dataframe.

Univariate Analysis¶

In [10]:
#making sure that user_id values are unique by checking basic distribution of user_id values
#using seaborn .countplot() function to make user_id bar graph/ color is set to light steel blue
sns.countplot(data=users, x='user_id', color='lightsteelblue'); #data from users dataframe is used to create a bar graph, placing user_id on the x-axis
plt.xticks(rotation=90, size=3); #x-axis labels rotated 90 degrees/size is set to 3 to decrease clutter
plt.title('Basic Distribution of User ID'); #setting title of bar graph
plt.xlabel('User ID'); #setting title of x-axis
plt.ylabel('Number of Users'); #setting title of y-axis
plt.show(); #displaying bar graph

Observations:¶

All of the User ID values are unique, and no User ID value is seen more than once in the users dataframe.

In [11]:
#making sure there are an equal number of users in each group
#using seaborn .countplot() function to make bar graph to view the number of users in each group/color palette is set to use pastel colors
sns.countplot(data=users, x='group', palette='pastel'); #bar graph is created with data from users dataframe, placing group on x-axis
plt.title('Number of Users in Each Group'); #setting title of bar graph
plt.xlabel('Group'); #setting title of x-axis
plt.ylabel('Number of Users'); #setting title of y-axis
plt.show(); #displaying bar graph

Observations:¶

  • The two groups are control and treatment.
  • There are 50 users in each group.
In [12]:
#making sure that there are an equal number of users who viewed each landing page
#using seaborn .countplot() function to make bar graph to examine number of users who viewed each landing page/color palette is set to use pastel colors
sns.countplot(data=users, x='landing_page', palette='pastel'); #using data from users dataframe to create bar graph, placing landing_page on x-axis
plt.title('Number of Users Who Viewed Each Landing Page'); #setting title of bar graph
plt.xlabel('Landing Page'); #setting title of x-axis
plt.ylabel('Number of Users'); #setting title of y-axis
plt.show(); #displaying bar graph

Observations:¶

  • The two categories are new and old.
  • 50 people viewed each landing page.
In [13]:
#using seaborn .boxplot() function to make boxplot for examining the time users spent on the landing page/color palette is set to use pastel colors
sns.boxplot(data=users, x='time_spent_on_the_page', palette='pastel'); #using data from users dataframe to create boxplot, placing time_spent_on_the_page on the x-axis
plt.title('Time Spent on Landing Page'); #setting title of boxplot
plt.xlabel('Time Spent'); #setting title of x-axis
plt.show(); #displaying boxplot

Observations:¶

  • Time spent on the landing page seems to have no heavy skewness.
  • The median is a little over 5 minutes, and the maximum is a little over 10 minutes.
  • The minimum is much less than one minute.
In [14]:
#using seaborn .countplot() function to create bar graph to examine how many users converted and how many did not/color palette is set to use pastel colors
sns.countplot(data=users, x='converted', palette='pastel'); #using data from users dataframe to create bar graph, placing converted on x-axis
plt.title('Users based on Conversion'); #setting title of bar graph
plt.xlabel('Converted'); #setting title of x-axis
plt.ylabel('Number of Users'); #setting title of y-axis
plt.show(); #displaying bar graph

Observations:¶

  • There are more users that converted.
  • There is a small difference between the number of users who converted and the number of users who did not convert.
In [15]:
#using seaborn .countplot() function to create bar graph to examine the language preference of users/color palette is set to use pastel colors
sns.countplot(data=users,x='language_preferred', palette='pastel'); #using data from users dataframe to create bar graph, placing language_preferred on x-axis
plt.title('Language Preference of Users'); #setting title of bar graph
plt.xlabel('Language Preferred'); #setting title of x-axis
plt.ylabel('Number of Users'); #setting title of y-axis
plt.show(); #displaying bar graph

Observations:¶

  • The three languages are Spanish, French, and English.
  • The language preference difference among the three languages is not too heavy. It differs slightly.
  • There is more preference for Spanish and French than English.

Bivariate Analysis¶

In [16]:
#using seaborn .countplot() function to create bargraph for comparing group with landing_page
#group (categorical variable) and landing_page (categorical variable)/color palette uses pastel colors
sns.countplot(data=users, x='group', hue='landing_page', palette='pastel'); #using data from users dataframe to create bar graph with group on x-axis/bar colors representing landing_page
plt.title('Group and Landing Page'); #setting title of bar graph
plt.xlabel('Group'); #setting title of x-axis
plt.ylabel('Number of Users'); # setting title of y-axis
plt.legend(title='Landing Page'); #setting title of legend
plt.show(); #displaying bar graph

Observations:¶

  • All of the users in the control group viewed the old page.
  • All of the users in the treatment group viewed the new page.
In [17]:
#using seaborn .boxplot() function to compare group and time_spent_on_the_page
#time_spent_on_the_page (numerical variable) vs. group (categorical variable)/color is set to thistle
sns.boxplot(data=users, x='group', y='time_spent_on_the_page', color='thistle');#using data from users dataframe to create boxplot with group on x-axis and time_spent_on_the_page on y-axis
plt.title('Time Spent on Landing Page vs. Group'); #setting title of boxplot
plt.xlabel('Group'); #setting title of x-axis
plt.ylabel('Time Spent on Landing Page'); #setting title of y-axis
plt.show(); #displaying boxplot

Observations:¶

  • The data in the control group is right-skewed because the median is closer to the bottom(left) end, making the right whisker longer.
  • The median time spent on the landing page in the treatment group is higher than the control group median.
  • Data is more spread out in the control group. There are 3 outliers in the treatment group.
In [18]:
#using seaborn .countplot() function to make bar graph for comparing group with converted
#group (categorical variable) and converted (categorical variable)/color palette is set to use pastel colors
sns.countplot(data=users, x='group', hue='converted', palette='pastel');#using data from users dataframe to create bar graph with group on x-axis/bar colors representing converted
plt.title('Group and Conversion'); #setting title of bar graph
plt.xlabel('Group'); #setting title of x-axis
plt.ylabel('Number of Users'); #setting title of y-axis
plt.legend(title='Conversion'); #setting title of legend
plt.show(); #displaying bar graph

Observations:¶

  • More users in the control group did not convert to subscribers.
  • More users in the treatment group converted to subscribers.
In [19]:
#using seaborn .countplot() function to makre bar graph for comparing group and language_preferred
#group (categorical variable) vs. language_preferred (categorical variable)/color palette is set to use pastel colors
sns.countplot(data=users, x='group', hue='language_preferred', palette='pastel');#using data from users dataframe to create bar graph with group on x-axis/bar colors representing language_preferred
plt.title('Group and Language Preference'); #setting title of bar graph
plt.xlabel('Group'); #setting title of x-axis
plt.ylabel('Number of Users'); #setting title of y-axis
plt.legend(title='Language Preference'); #setting title of legend
plt.show(); #displaying bar graph

Observations:¶

  • The distribution of language preference is the same in the control and treatment groups.
  • In both groups, there are more users who prefer Spanish and French than English, but it is only a slight difference.
In [20]:
#using seaborn .boxplot() function to make boxplot for comparing landing_page and time_spent_on_the_page
#time_spent_on_the_page (numerical variable) vs. landing_page (categorical variable)/color is set to thistle
sns.boxplot(data=users, x='landing_page', y='time_spent_on_the_page', color='thistle');#using data from users dataframe to create boxplot with landing_page on x-axis and time_spent_on_the_page on y-axis 
plt.title('Time Spent on Landing Page vs. Landing Page'); #setting title of boxplot
plt.xlabel('Landing Page'); #setting title of x-axis
plt.ylabel('Time Spent on Landing Page'); #setting title of y-axis
plt.show(); #displaying boxplot

Observations:¶

  • The data for the old page is right-skewed.
  • The median time spent on the new landing page is higher than the median for the old page, and there are three outliers in new.
  • The data for the old page is more spread out than the data for the new page.
In [21]:
#using seaborn .countplot() function to make bar graph for comparing landing_page and converted
#landing_page (categorical variable) and converted (categorical variable)/color palette is set to use pastel colors
sns.countplot(data=users, x='landing_page', hue='converted', palette='pastel'); #using data from users dataframe to create bar graph with landing_page on x-axis/bar colors representing converted
plt.title('Landing Page and Conversion'); #setting title of bar graph
plt.xlabel('Landing Page'); #setting title of x-axis
plt.ylabel('Number of Users'); #setting title of y-axis
plt.legend(title='Conversion'); #setting title of legend
plt.show(); #displaying bar graph

Observations:¶

  • More users who viewed the old landing page did not convert to subscribers.
  • More users who viewed the new page converted to subscribers.
In [22]:
#using seaborn .countplot() function to make bar graph for comparing landing_page amd language_preferred
#landing_page (categorical variable) and language_preferred (categorical variable)/color palette is set to use pastel colors
sns.countplot(data=users, x='landing_page', hue='language_preferred', palette='pastel');#using data in users dataframe to make bar graph with landing_page on x-axis/bar colors presenting language_preferred
plt.title('Landing Page and language Preference'); #setting title of bar graph
plt.xlabel('Landing Page'); #setting title of x-axis
plt.ylabel('Number of Users'); #setting title of y-axis
plt.legend(title='Language Preference'); #setting title of legend
plt.show(); #displaying bar graph

Observations:¶

  • The language preference distribution is equal for users who viewed the old page and users who viewed the new page.
  • There are more users who prefer Spanish and French than English in old and new.
In [23]:
#using seaborn .boxplot() function to make boxplot for comparing time_spent_on_the_page and converted
#time_spent_on_the_page (numerical variable) vs. converted (categorical variable)/color is set to thistle
sns.boxplot(data=users, x='converted', y='time_spent_on_the_page', color='thistle'); #using data from users dataframe to make boxplot with converted on x-axis and time_spent_on_the_page on y-axis
plt.title('Time Spent on Landing Page vs. Conversion'); #setting title of boxplot
plt.xlabel('Conversion'); #setting title of x-axis
plt.ylabel('Time Spent on Landing Page'); #setting title of y-axis
plt.show(); #displaying boxplot

Observations:¶

  • The median time spent on the landing page is much higher for users who converted than users who did not convert.
  • The data for users who did not convert is more spread out than the data for users who converted.
In [24]:
#using seaborn .boxplot() function to make box plot for comparing language_preferred and time_spent_on_the_page
#time_spent_on_the_page (numerical variable) vs. language_preferred (categorical variable)/color is set to thistle
sns.boxplot(data=users, x='language_preferred', y='time_spent_on_the_page', color='thistle'); #using data from users dataframe to make boxplot with language_preferred on x-axis and time_spent_on_the_page on y-axis
plt.title('Time Spent on Landing Page vs. Language Preference'); #setting title of boxplot
plt.xlabel('Language Preference'); #setting title of x-axis
plt.ylabel('Time Spent on Landing Page'); #setting title of y-axis
plt.show(); #displaying boxplot

Observations:¶

  • The data for Spanish and English is left-skewed because the median is closer to the top(right) end, making the left whisker longer.
  • The median only slightly varies among the three languages. The median for English is slightly higher than the median for Spanish and French, and the median for French is slightly lower than the median for Spanish and English.
  • The data for English and French is more spread out than the data for Spanish.
In [25]:
#using seaborn .countplot() function to make bar graph for comparing converted and language_preferred
#converted (categorical variable) and language_preferred (categorical variable)/ color palette is set to use pastel colors
sns.countplot(data=users, x='converted', hue='language_preferred', palette='pastel');#using data from users dataframe to make bar graph with converted on x-axis/bar colors representing language_preferred
plt.title('Conversion and Language Preference'); #setting title of bar graph
plt.xlabel('Conversion'); #setting title of x-axis
plt.ylabel('Number of Users'); #setting title of y-axis
plt.legend(title='Language Preference') #setting title of legend
plt.show(); #displaying bar graph

Observations:¶

  • More users who converted to subscribers preferred English.
  • More users who did not convert to subscribers preferred French.

1. Do the users spend more time on the new landing page than the existing landing page?¶

Visual Analysis¶

In [26]:
#visual depiction of time_spent_on_the_page(numerical variable) vs. landing_page (categorical variable)
#using seaborn .boxplot() function to make boxplot/setting color of boxplot to thistle
sns.boxplot(data=users, x='landing_page', y='time_spent_on_the_page', color='thistle'); #using data from users dataframe to make boxplot with landing_page on x-axis and time_spent_on_the_page on y-axis
plt.title('Time Spent on Landing Page vs. Landing Page'); #setting title of boxplot
plt.xlabel('Landing Page'); #setting title of x-axis
plt.ylabel('Time Spent on Landing Page'); #setting title of y-axis
plt.show(); #displaying boxplot

Observations:¶

  • The median time spent on the landing page is much higher for the new page than the old page.
  • The data for the old page is more spread out than the new page data.

Null and Alternate Hypotheses¶

By checking the standard deviation and testing the mean of the old page and new page samples, it can either be accepted or rejected that users spent the same amount of time on the landing pages. If this is rejected, then the alternative is that the users on the new landing page spent more time than the users on the old landing page.

The mean time spent on the new landing page is $\mu_1$. The mean time spent on the old landing page is $\mu_2$.

Null Hypothesis: Both means are equal.

$H_0: \mu_1=\mu_2$

Alternative Hypothesis: The mean for the new page is greater than the mean for the old page.

$H_a: \mu_1 > \mu_2$

Appropriate Test¶

In [27]:
#checking standard deviation of the time spent on the new page
#adding data for the new landing page into a new dataframe called new_page
new_page = users[users['landing_page'] == 'new']
#using .std() function to calculate standard deviation for the time_spent_on_the_page column of the new_page dataframe
new_std_deviation = new_page['time_spent_on_the_page'].std()

#checking standard deviation of the time spent on the old page
#adding data for the old landing page into a new dataframe called old_page
old_page = users[users['landing_page'] == 'old']
#using .std() function to calculate standard deviation for the time_spent_on_the_page column of the old_page dataframe
old_std_deviation = old_page['time_spent_on_the_page'].std()

#printing both standard deviations
print('Standard Deviation for New Page:', new_std_deviation)
print('Standard Deviation fot Old Page:', old_std_deviation)
Standard Deviation for New Page: 1.8170310387878263
Standard Deviation fot Old Page: 2.581974849306046

The test that best fits this situation is the two independent sample t-test because we are comparing means from two independent population samples and the standard deviations are unequal or unknown.

  • The standard deviations of both samples are unequal, but the standard deviations of the populations from which the samples were taken have not been given directly, so the standard deviations for the populations are not known.
  • The two samples (old and new) are not related to each other. They are independent samples because 50 users viewed the old page and 50 users viewed the new page. No user seemed to view both pages.
  • The time spent on the page is continuous data.
  • The users in this experiment were selected randomly.

Significance Level¶

The data is evaluated at a 5% significance level.

$\alpha$ = 0.05

Collecting and Preparing Data¶

In [28]:
#adding just the time_spent_on_the_page column of the new_page data frame into a new dataframe called time_spent_on_new_page
time_spent_on_new_page = new_page['time_spent_on_the_page']
#adding just the time_spent_on_the_page column of the old_page dataframe into a new dataframe called time_spent_on_old_page
time_spent_on_old_page = old_page['time_spent_on_the_page']

Calculating the p-value¶

In [29]:
#importing the independent t-test function
from scipy.stats import ttest_ind

#using the ttest_ind() function to calculate the p-value
#equal_var is set to false because the variations for both samples are not equal
#alternative is set to greater because the alternative hypothesis is that time spent on the new page is greater than time spent on old page
test_stat, p_value = ttest_ind(time_spent_on_new_page, time_spent_on_old_page, equal_var = False, alternative = 'greater')

print('p-value:', p_value) #printing the p-value
p-value: 0.0001392381225166549

Comparing the p-value with $\alpha$¶

In [30]:
print('α =', 0.05) #printing the alpha value
print('p-value =', p_value) #printing the p-value
α = 0.05
p-value = 0.0001392381225166549

The p-value is much smaller than $\alpha$.

Drawing Inferences¶

Since the p-value is much smaller than $\alpha$, the null hypothesis can be rejected. The alternative hypothesis is supported by this statistical test. The users on the new landing page spent more time on the page than the users on the old landing page.

2. Is the conversion rate (the proportion of users who visit the landing page and get converted) for the new page greater than the conversion rate for the old page?¶

Visual Analysis¶

In [31]:
#visual analysis of conversion and landing page
#using seaborn .countplot() to make bar graph for examining converted and landing_page/color palette is set to use pastel colors
sns.countplot(data=users,x='landing_page', hue='converted', palette='pastel'); #using data from users dataframe to make bar graph with landing_page on x-axis/bar colors representing converted
plt.title('Landing Page and Conversion'); #setting title of bar graph
plt.xlabel('Landing Page'); #setting title of x-axis
plt.ylabel('Number of Users'); #setting title of y-axis
plt.legend(title='Conversion'); #setting title of legend
plt.show(); #displaying bar graph

Observations:¶

  • There are more users who converted in the new page than the old page.
  • There are more users that did not convert in the old page than the new page.

Null and Alternate Hypotheses¶

It can either be accepted or rejected that the proportion of users who converted in the new page is the same as the proportion of users who converted in the old page. If this is rejected, then the proportion of users who converted in the new page is greater than the proportion of users who converted in the old page.

The proportion of users who converted in the new page is $p_1$. The proportion of users who converted in the old page is $p_2$.

Null Hypothesis: The proportion of users who converted in the new page and the proportion of users who converted in the old page are both equal.

$H_0 : P_1 = P_2$

Alternate Hypothesis: The proportion of users who converted in the new page is greater than the proportion of users who converted in the old page.

$H_a : P_1 > P_2$

Appropriate Test¶

In [32]:
#calculating the number of users who converted in the new page
new = users[users['landing_page'] == 'new'] #adding users in new landing page into another dataframe called new
new_converted = new[new['converted'] == 'yes']#adding the users who converted in the new dataframe into a dataframe called new_converted
# and adding those who did not convert into a frame called new_not_converted
new_not_converted = new[new['converted'] == 'no']

#calculating the number of users who converted in the old page
old = users[users['landing_page'] == 'old'] #adding the users in old landing page into another dataframe called old
old_converted = old[old['converted'] == 'yes']#adding the users who converted in the old dataframe into a dataframe called old_converted
#and adding those that did not convert into a dataframe called old_not_converted
old_not_converted = old[old['converted'] == 'no']

#printing the counts for converted and not converted in the new page and old page
print('Users that converted in the new page:', new_converted['converted'].count())
print('Users that did not convert in the new page:', new_not_converted['converted'].count())
print('Users that converted in the old page:', old_converted['converted'].count())
print('Users that did not convert in the old page:', old_not_converted['converted'].count())

#It is already known that there are a total of 50 users who viewed the new page and 50 users who viewed the old page
Users that converted in the new page: 33
Users that did not convert in the new page: 17
Users that converted in the old page: 21
Users that did not convert in the old page: 29

The test that best fits this situation is the two proportions z-test because we are comparing two proportions from two independent population samples.

  • The samples are both independent and randomly selected.
  • The two population samples are binomially distributed because there are only two possible outcomes which are yes (converted) and no (not converted).
  • np and n(1-p) for both population samples are equal to or greater than 10.

np and n(1-p) for the proportion of users who converted in the new page:
np = 50
(33/50) = 33 and 33 > 10
n(1-p) = 50 ((50-33)/50) = 17 and 17 > 10
np and n(1-p) for the proportion of users who converted in the old page:
np = 50
(21/50) = 21 and 21 > 10
n(1-p) = 50 ((50-21)/50) = 29 and 29 > 10

Significance Level¶

The data is evaluated at a 5% significance level.

$\alpha$ = 0.05

Calculating the p-value¶

In [33]:
#importing the proportions z-test function
from statsmodels.stats.proportion import proportions_ztest

#assigning the values for conversion in the new page and old page in numpy array form into a variable called users_converted
users_converted = np.array([33,21])

#assigning the total number of users in the new page and old page in numpy array form into a variable called total_observations
#It has already been established previously that there are 50 users in the new page and 50 in the old page
total_observations = np.array([50,50])

#using the proportions_ztest() function for calculating the p-value
test_stat, p_value = proportions_ztest(users_converted, total_observations, alternative='larger')

print('p-value:', p_value) #printing the p-value
p-value: 0.008026308204056278

Comparing the p-value with $\alpha$¶

In [34]:
print('α =', 0.05) #printing the alpha value
print('p-value =', p_value) #printing the p-value
α = 0.05
p-value = 0.008026308204056278

The p-value is smaller than $\alpha$.

Drawing Inferences¶

Since the p-value is smaller than $\alpha$, the null hypothesis can be rejected. The alternative hypothesis is supported by this statistical test. The conversion rate for the new page is greater than the conversion rate for the old page.

3. Is the conversion and preferred language independent or related?¶

Visual Analysis¶

In [35]:
#visual analysis of conversion and language preference
#using seaborn .countplot() to make bar graph/ color palette is set to use pastel colors
sns.countplot(data=users, x='language_preferred', hue ='converted', palette='pastel');#using data from users dataframe to make bar graph with language_preferred on x-axis/bar colors representing converted
plt.title('Language Preference and Conversion'); #setting title of bar graph
plt.xlabel('Language Preference'); #setting title of x-axis
plt.ylabel('Number of Users'); #setting title of y-axis
plt.legend(title='Conversion'); #setting title of legend
plt.show(); #displaying bar graph

Observations:¶

  • More users that preferred English converted, compared to the users who preferred French and Spanish.
  • More users that preferred French did not convert, compared to the users who preferred English and Spanish.
  • In the users that preferred Spanish, there are more users that converted than users that did not convert.

Null and Alternate Hypotheses¶

It can either be accepted or rejected that conversion is independent of language preference. If this is rejected, then conversion depends on language preference.

Null Hypothesis

$H_0$ : Conversion is independent of language preference.

Alternate Hypothesis

$H_a$ : Conversion depends on language preference.

Appropriate Test¶

The test that best fits this situation is the chi-square test for independence because we are comparing two categorical variables to examine if they are independent of each other or related to each other.

  • There are at least 5 observations in each variable of the sample.
  • Both of the variables are categorical variables.
  • The samples were randomly selected.

Significance Level¶

The data is evaluated at a 5% significance level.

$\alpha$ = 0.05

Collecting and Preparing Data¶

In [36]:
#using pd.crosstab() function to create a contingency table with just the converted and language_preferred variables
conversion_languages = pd.crosstab(users.converted, users.language_preferred)

#printing the contingency table
print(conversion_languages)
language_preferred  English  French  Spanish
converted                                   
no                       11      19       16
yes                      21      15       18

Calculating the p-value¶

In [37]:
#importing the chi2_contingency() function
from scipy.stats import chi2_contingency

#using the chi2_contingency() function for calculating the p-value
chi, p_value, dof, expected = chi2_contingency(conversion_languages)

print('p-value:', p_value) #printing the p-value
p-value: 0.21298887487543447

Comparing the p-value with $\alpha$¶

In [38]:
print('α =', 0.05) #printing the value of α
print('p-value =', p_value) #printing the p-value
α = 0.05
p-value = 0.21298887487543447

The p-value is larger than $\alpha$.

Drawing Inferences¶

Since the p-value is larger than $\alpha$, the null hypothesis will not be rejected. Conversion is independent of language preference.

4. Is the time spent on the new page same for the different language users?¶

Visual Analysis¶

In [39]:
#creating a copy of the dataframe users, but with just the users who viewed the new page
new_users = users[users['landing_page'] == 'new']

#visual analysis of time_spent_on_the_page (numerical variable) vs. language_preferred (categorical variable)
#using seaborn .boxplot() function to make boxplot/color is set to thistle
sns.boxplot(data=new_users, x='language_preferred', y='time_spent_on_the_page', color='thistle'); #using data from new_users dataframe to make boxplot with language_preferred on x-axis and time_spent_on_the_page on y-axis
plt.title('Time Spent on Landing Page vs. Language Preference'); #setting title of boxplot
plt.xlabel('Language Preference'); #setting title of x-axis
plt.ylabel('Time Spent on Landing Page'); #setting title of y-axis
plt.show(); #displaying boxplot

Observations:¶

  • The median time spent on the landing page seems to be the highest for English, and lowest for French.
  • The data is more spread out for English and French than Spanish.

Null and Alternate Hypotheses¶

It can either be accepted or rejected that users with different language preferences spent the same amount of time on the page. If this is rejected, then users with different language preferences spent different amounts of time on the page.

The mean time for users with the language preference Spanish is $\mu_1$. The mean time for users with the language preference English is $\mu_2$. The mean time for users with the language preference French is $\mu_3$.

Null Hypothesis: The time spent on the page is the same for users with different language preferences.

$H_0 : \mu_1 = \mu_2 = \mu_3$

Alternate Hypothesis: The time spent on the page is different for at least one of these language preference groups.

$H_a$ : One or more of these language preference groups spent a different amount of time on the page.

Appropriate Test¶

The test that best fits this situation is the one-way ANOVA F-test because three independent populations are being compared.

Before performing this test, it is best to make sure that the data, concerning time spent on the page, follows a normal distribution by using the Shapiro-Wilk's test. Equal variance should also be examined using Levene's test.

  • The population samples are independent and selected randomly.
Using Shapiro-Wilk's Test to Examine Normal Distribution¶

Null Hypothesis

$H_0$ : The data for time spent on the page is normally distributed.

Alternate Hypothesis

$H_a$ : The data for time spent on the page is not normally distributed.

In [40]:
#performing shapiro-wilk's test to examine normal distribution
#importing the stats function to carry out shapiro-wilk's test
from scipy import stats

#calculating the p-value for shapiro-wilk's test
w, p_value = stats.shapiro(new_users['time_spent_on_the_page']) #performing the test on the time_spent_on_the_page column of the new_users dataframe

#printing the p-value of shapiro-wilk's test
print("Shapiro-Wilk's Test p-value:", p_value)
Shapiro-Wilk's Test p-value: 0.8040016293525696

The null hypothesis cannot be rejected due to the large p-value. The data for the time spent on the page is normally distributed.

Using Levene's Test to Examine Equal Variance¶

Null Hypothesis

$H_0$ : The variances are equal for each language preference based on time spent on the page.

Alternate Hypothesis

$H_a$ : The variance is not the same for at least one language preference based on time spent on the page.

In [41]:
#performing levene's test to examine equality of variance
#importing the levene() function
from scipy.stats import levene

#calculating the p-value for levene's test
#inputting time_spent_on_the_page column of the new_users dataframe followed by the language_preferred column data specific for each language
statistic, p_value = levene(new_users['time_spent_on_the_page'][new_users['language_preferred'] == 'Spanish'],
                            new_users['time_spent_on_the_page'][new_users['language_preferred'] == 'English'],
                            new_users['time_spent_on_the_page'][new_users['language_preferred'] == 'French'])

#printing the p-value of levene's test
print("Levene's Test p-value:", p_value)
Levene's Test p-value: 0.46711357711340173

The p-value is larger than 0.05, so it can be interpretted as accepting the null hypothesis. The variances are equal for each language preference based on time spent on the page.

Normal distribution and equal variance are both supported by the data, so the one-way ANOVA F-test can be performed.

Significance Level¶

The data is evaluated at a 5% significance level.

$\alpha$ = 0.05

Calculating the p-value¶

In [42]:
#using the one-way ANOVA F-test to calculate the p-value
#importing the f_oneway() function
from scipy.stats import f_oneway

#calculating the p-value with the f_oneway() function
#inputting the time_spent_on_the_page from the new_users dataframe based on data for each language from the language_preferred column
test_stat, p_value = f_oneway(new_users.loc[new_users['language_preferred'] == 'Spanish', 'time_spent_on_the_page'],
                              new_users.loc[new_users['language_preferred'] == 'English', 'time_spent_on_the_page'],
                              new_users.loc[new_users['language_preferred'] == 'French', 'time_spent_on_the_page'])

print('p-value:', p_value)#printing the p-value
p-value: 0.43204138694325955

Comparing the p-value with $\alpha$¶

In [43]:
print('α =', 0.05) #printing the value of alpha
print('p-value =', p_value) #printing the p-value
α = 0.05
p-value = 0.43204138694325955

The p-value is larger than $\alpha$.

Drawing Inferences¶

Since the p-value is much larger than $\alpha$, the null hypothesis is not rejected. The time spent on the page is the same for users with different language preferences.

Conclusion and Business Recommendations¶

Conclusion:¶

  • Users who viewed the new landing page viewed the page longer than users who viewed the old landing page.
  • The time that users spent on the old landing page varied more than the time users spent on the new page.
  • In both groups, there are more users who preferred Spanish and French.
  • More users that converted preferred English.
  • The number of users who converted to subscribers is higher for the new page than the old page.
  • Users that converted to subscribers spent more time on the landing page than users that did not convert to subscribers.
  • Different language preferences are not related to whether users will convert to subscribers.
  • Users with different language preferences spent the same amount of time on the landing page.

Business Recommendations:¶

  • More relevant content could be one possible reason why users who viewed the new page viewed it longer than the users who viewed the old page. What a person perceives as relevant content will be different from person to person, so it is important to personalize this content according to the users' interests. Perhaps, there was more variation in the time users spent viewing the old page because the content on the page only aligned with a few users' interests. Other users might not have found it interesting, so they did not view it longer. The new page might have aligned with more users' interests, so there was less variation in the time spent for new users. Further analysis could be performed on the factors that could have contributed to users on the new page viewing it for a longer period of time.
  • In both groups of users, there are more users who preferred French and Spanish, but in the users who converted, there are more users that preferred English. The reason for this has to be a factor other than just language because conversion and language preference are not related. It could be the type of content that was seen by these users. If the content was the same for all users, it can help to perform further analysis on what could have caused more English users to subscribe when there were more Spanish and French users present.
  • There are more users who viewed the new page and converted to subscribers than users who viewed the old page and subscribed. The new page is increasing the time spent on the page and subscription rate, which are both beneficial for E-news Express. Users who subscribed spent more time on the page than those that did not subscribe. Increasing interactivity can not just help the company understand its users, but it can also increase conversion rate. Along with relevant and personalized content, adding tools to express their opinions and interact with other users can keep users on the page longer. This can increase the chances of them subscribing.