Split Testing Common Questions
Don't see the answer you're looking for? Get in touch.
A/B Split Test Sample Size CalculatorUse this A/B Split Test Sample Size Calculator!
This easy-to-use tool helps you find out how many visitors you need to test two different versions of your webpage or app.
To use this calculator effectively, input the following details:
Baseline Conversion Rate: Your current conversion rate (e.g., the percentage of visitors who make a purchase). Minimum Detectable Effect: The smallest change in the conversion rate you want to detect.
Statistical Significance: The confidence level you want for your test results (usually 95%).
Statistical Power: The probability of detecting a true effect (typically set at 80%).
By entering these values, you'll get the recommended sample size for each version of your test, ensuring your results are both accurate and meaningful.
Baseline Conversion Rate (%):Your control group's expected conversion rate.
Minimum Detectable Effect (%):The minimum relative change in conversion rate you would like to be able to detect.
Statistical Significance (%):Typically set at 95%, this is your risk tolerance for Type I error.
Statistical Power (%):Typically set at 80%, this is your tolerance for Type II error.
Beyond the basic parameters, several advanced considerations can impact the accuracy and reliability of your A/B tests:
Multiple Testing: Conducting multiple tests simultaneously increases the probability of obtaining a false positive result. This can be mitigated by using techniques such as the Bonferroni correction, which adjusts the significance level to account for multiple comparisons.
Non-Uniform Traffic: When traffic is not evenly distributed across the test duration, it can affect the accuracy of the results. Techniques such as stratified sampling can help ensure that the test groups are representative of the overall population, improving the validity of the results.
External Validity: A/B tests may not always generalize to the broader population. Techniques such as propensity scoring can help address this issue by ensuring that the test sample is representative of the target audience, enhancing the external validity of the results.
By considering these advanced factors, you can design more robust A/B tests that provide reliable and generalizable insights.
A/B Test Duration and Planning
Planning the duration of an A/B test is a critical component of a successful testing strategy. The test duration should be long enough to detect a statistically significant difference between the control and variation groups but not so long that it becomes impractical or costly.
Key factors to consider include:
Average Daily Visitors: The number of visitors your site receives daily is essential for determining the test duration. A higher number of average daily visitors allows for a shorter test duration, as the required sample size can be reached more quickly.
Test Duration: The length of time needed to detect a statistically significant difference between the control and variation groups. This depends on the average daily visitors and the required sample size.
Sample Size: The number of visitors required to detect a statistically significant difference. This is influenced by the baseline conversion rate, minimum detectable effect, desired lift, and statistical power.
Statistical Power: The probability of detecting a statistically significant difference when there is one. Higher statistical power requires a larger sample size and potentially a longer test duration.
Type I Error (False Positive)
Definition: A Type I error occurs when you reject the null hypothesis when it is actually true. In other words, you detect an effect that is not actually there.
Example: Imagine you run an A/B test to see if a new webpage design (Version B) increases conversions compared to the current design (Version A). A Type I error would mean concluding that Version B is better when, in reality, it isn’t.
Statistical Significance: This error is controlled by the significance level (α), often set at 0.05 (5%). If
your significance level is 5%, there is a 5% chance of making a Type I error.
Type II Error (False Negative)
Definition: A Type II error occurs when you fail to reject the null hypothesis when it is actually false. This means you miss detecting an effect that is actually present.
Example: Continuing with the A/B test example, a Type II error would mean concluding that there is no difference between Version A and Version B when, in fact, Version B does increase conversions.
Statistical Power: This error is controlled by the power of the test (1 - β), often set at 0.80 (80%). If your test has 80% power, there is a 20% chance of making a Type II error.
Key Points to Remember
Type I Error: False positive; detecting a difference when none exists. Controlled by the significance level (α).
Type II Error: False negative; failing to detect a difference when one exists. Controlled by the power of the test (1 - β).
Don't see the answer you're looking for? Get in touch.
This is the expected conversion rate of your control group without any changes.
The smallest change in conversion rate that you want to detect, expressed as a percentage increase or decrease from the baseline.
It helps you determine the likelihood that your results are due to the changes made and not by random chance.
It indicates the probability that your test will detect a true effect when there is one.