The Wilks Formula is by far the most popular standard of comparison for strength across different weight classes and genders in powerlifting. However, the formula is skewed in favor of super-heavy and super-light lifters. Several prominent figures in the powerlifting community have been voicing their concerns about Wilks, including Greg Nucklos (in his article Who’s The Most Impressive Powerlifter?), and Ben Pollack, who commented:
Wilks is, in my opinion, the #1 most unfair thing about modern powerlifting. Notice how there’s no guys on that list (referring to a subset of Greatest Wilks Scores) under 275? There’s a reason for that: Wilks is bullshit and heavily favors heavier lifters. Steve Denison said he’d consider a different metric for the USPA (2nd largest powerlifting federation in the USA) next year and I really hope he follows through on that. 
Robert Wilks and the International Powerlifting Federation (IPF, largest powerlifting federation) have said that they are working on a new formula. However, they have been taking too long. Fortunately, I have already found a solution.
This essay will first examine the Wilks Formula bias, and then propose a better formula.
How Biased is the Wilks Formula?
Below are the all-time top 10 Wilks coefficient scores for raw lifters based on performances at drug-tested meets.  Extreme bodyweights are in red.
|Name||Sex||Bodyweight (lb)||Total (lb)||Wilks|
The #1 lifter, Ray Williams, weighs 402 lb and the #3 lifter, Sergey Fedosienko, is 4’9. Both of these lifters are in classes where there is far less competition, and they are #1 and #3 in the world!
Below is the distribution of bodyweights for competitive lifters at raw drug-tested meets. The bars for the 6 highlighted bodyweights above are very small to practically invisible.
There were only 15 male competitors in the 400–420 lb range where Ray is. In contrast, the 160–180 lb weight range for men hosts the most competitors at 7,871. So a 400 lb man with the best performance out of 15 is equivalent to a 160 lb man with the best performance out of 7,871 according to the Wilks Formula.
Another problem with Wilks is that it gets pretty screwy once the bodyweight gets over 600 lb. A 623 lb man lifting 100 lb will crush all-time records with a 1,400 Wilks score. But he better not weigh in at 624 lb, because then he will receive a Wilks of -282,679. Nobody has yet taken advantage of this in competition, but people weighing over 600 lb certainly exist and they may take advantage of this in the future.
The cause of these problems is that the formula is modeled on only a handful of elite athletes (those in the IPF from 1987 to 1994), and with such a small sample size, the model is prone to bias.
The simplest and crudest measure of relative strength is the bodyweight multiple, but this measure (total/bodyweight) is skewed in favor of lighter lifters to the point that it’s much worse than Wilks. For example, Lamar Gant deadlifted 661 lb at 132 lb for a 5x multiple, while Eddie Hall – who is arguably the greatest deadlifter of all-time – deadlifted 1102 lb at 440 lb for a 2.5x multiple.
Other alternatives to the Wilks Formula include the Glossbrenner coefficient, Reshel coefficient, NASA coefficient, Schwartz/Malone coefficient and Siff coefficient. However, most of these formulas still use the same approach as Wilks, just calibrated on different data sets, so they maintain many of the same problems as the Wilks and are therefore seldom used.
Greg Nuckols wrote an excellent article with suggestions such as allometric scaling. However, Greg admits that allometric scaling is still problematic at very high bodyweights.
A Better Formula for Relative Strength
The most impressive lifter is the athlete who most drastically outperforms the competition. So if lifter A wins the 140 lb class with a 1,500 lb total versus a 1,000 lb average, and lifter B wins the 250 lb class with a 1,700 lb total versus a 1,600 lb average, then lifter A should win best lifter over lifter B because lifter A was more dominant in their weight class. How do we convert this relationship into a math formula? By calculating which athlete scored the most standard deviations above the mean relative to their bodyweight (i.e. compare z-scores).
On to the Math
Let’s first plot powerlifting totals by bodyweight to get a sense of the shape of the distribution. I also performed a linear regression and 3rd degree fitted polynomial by gender. I tried higher degree polynomials too, but they looked ridiculous because the curve tried too hard to fit the super heavy weights even though there’s not much data there.
You may be thinking, isn’t the problem with Wilks that it’s a fitted polynomial? So why am I trying to fit the distribution to a polynomial? The difference is that the Wilks formula cropped out only the elite performers and fitted a polynomial around them ignoring the rest of the distribution (non-elites). The method I am using fits a polynomial around the averages, and then takes into account the dispersion around the average.
After examining the linear regression compared to the polynomial, we end up using the linear regression anyways. The linear model fits the middle weight just as well, but fits the heavy weights better when compared to the polynomial. The polynomial has a big advantage in modeling super-light weights, but I examined those data points and a large majority of super-light weight lifters are children, explaining the dip in performance.
You may raise another valid concern: wouldn’t the performances of heavier lifters vary more than lighter lifters since they lift more total weight? That is, do the performances of heavier lifters have higher SD than those of lighter lifters? I calculated the SD for 50 lb segments from 100–400 lb for men, and it turns out that the SD is very consistent. I used 50 lb segments instead of weight classes because weight class ranges vary, which would cause higher SD in weight classes with wider ranges in bodyweights.
|Bodyweight (lb)||SD (lb)|
I then did the same analysis for women 100–300 lb and got a peculiar result. The SD for women clearly increases as bodyweight increases. I have no clue why, but I did play around with different models and got very consistent SD across bodyweights after switching to percent lifted relative to the mean, instead of total lifted. I only did this for women since doing this for men would mess up their SD. So the male and female strength formulas are fundamentally quite different.
|Bodyweight (lb)||SD (lb)||SD of %Δ (lb)|
One more adjustment. I added 3.5 to everyone’s scores and then multiplied scores by 100 making the new average score 350 instead of 0. I picked 3.5 and 100 because a 150 lb man lifting 1,000 pounds is roughly average in my model and also has a Wilks of ~350. I am not trying to perfectly scale my formula to the Wilks, but I figured it would be easier for people to adjust to a new formula with roughly similar ranges. Also, to be nice, I didn’t want to give half the population negative strength scores.
All units are in pounds. A score of 350 is average for the specified bodyweight. One SD is 100.
Males: 0.397958 * Total – 1.31975 * BW + 149.823
Females: ( 316.111 * Total )/( BW + 255.911 ) – 150.865
Calculator and Percentile
Feel free to play around with the calculator below. It also calculates a rough percentile relative to the best performances of other drug-tested raw athletes. 
Top 10 Powerlifters by Gender Using the Peidi Formula (Raw, Drug-Tested)
|Peidi Score||Wilks||Wilks Rank (Men Only)||BW LB||Total LB|
|Peidi Score||Wilks||Wilks Rank (Women Only)||BW LB||Total LB|
|Ana Rosa Castellain||753||530||10||153||1,171|
|Isabella von Weissenberg||716||505||25||156||1,130|
You may have some objections to the rankings above, which I will address.
1) Elite women overwhelmingly out-score elite men
After examining the data in detail, I have to recommend comparing men’s scores and women’s scores separately. The female distribution has a statistically significant positive skew while the male distribution has close to no skew. This positive skew means that the female distribution has a fatter tail at the elite end. Male and female scores just cannot be compared because the overall distribution shapes are different. The higher elite female scores cannot be fixed with a simple scaling factor, since adjusting the scores of elite athletes would also incorrectly adjust the scores of average athletes.
2) Higher representation of heavy-middle weights and lower representation of light and super-heavy weights compared to Wilks
I believe that this is fine and not an error with the model. First of all, this model calculates out-performance relative to the mean, so you should expect the top 10 elites to be composed of more athletes from larger weight classes (middle weights) – if you take the top 99.9th percentile of each weight class, there will be more athletes from the larger weight classes. This is in contrast to the Wilks formula which attempts to give the same score among elites of different weight classes regardless of the size of those weight classes.
But the largest weight classes are middle weights, so why are the elite athletes according to the Peidi Formula skewed towards heavy weights? Well, this sort of makes sense, since elite athletes generally have much more muscle mass than average athletes, which causes them to gravitate towards higher weight classes. Note that among trained athletes, muscle mass accounts for 60-65% of the variability in strength.  This would also explain the lower representation of light weights, since they are limited in the amount of muscle they can have, which limits their degree of out-performance relative to the mean.
Areas for Improvement
The main area for improvement is that male and female scores are not comparable. I don’t know a good way to fix this. As stated earlier, it’s not as simple as applying a multiple, because the shapes of the distributions are different, so if I make an adjustment to the elites, then the scores of all the non-elite men and women can no longer be compared.
I built my model using only competition data from raw, drug-tested meets without wraps. That’s not to say that my model would be invalid for non-drug tested results and wraps, I would need to do further analysis to confirm.
Raw Data Used
Thanks to openpowerlifting.org for making the data freely available
A csv of the data used can be downloaded here.
 Ben Pollack’s comments about the Wilks Formula: https://www.reddit.com/r/powerlifting/comments/74r14f/greatest_wilks_scores/do0ksv5/?context=3
 Source: Openpowerlifting.org. Extract date 5/13/2018. Parameters: Raw, All Classes, All Drug-Tested Feds, All Years, All Sexes.
 In powerlifting, “raw” means lifting with minimal supportive equipment (no wraps). Various elastic suits and materials are permitted in “equipped” powerlifting divisions where much heavier weights are lifted.
 Weight classes range from 130–265+ lb for men and 104-185+ lb for women for the majority of meet performances in the data pulled.
 The percentile calculation uses the z-score and assumes a Gaussian distribution. The actual distribution is known to have fatter tails for both genders and a right skew for females.
 Inter-individual variability in the adaptation of human muscle specific tension to progressive resistance training https://link.springer.com/article/10.1007%2Fs00421-010-1601-9
Last time I looked at the top 10 female Wilks scores of every weight class and correlated Wilks to body weight, the relationship was highly linear. Not the case with males. This seems to highlight that the depth of competition (while still growing rapidly) is not yet there for women. I think that explains why you can’t compare your new scoring system’s female scores to males, and possibly highlights it’s weakness.