Navigating Private Equity Data: What the Main Datasets Actually Contain
This note summarizes my paper Navigating Private Equity Data: A Critical Review of Key Sources, which provides an overview of the data used in academic research on private equity, their coverage, and their limitations
The goal of the paper is not to criticize any single provider. Instead, it aims to clarify what researchers can and cannot do with the available data, and to document the trade-offs involved when choosing between different sources.
Fund-level commercial datasets
The two most commonly used datasets for identifying private market funds are Preqin and PitchBook. Both provide large lists of funds with information on strategy, vintage year, and fund size. At first glance, these datasets appear extremely comprehensive, with over 100,000 funds each.
However, once basic filters are applied, the usable sample shrinks quickly. Restricting attention to closed-ended, commingled funds above a minimum size reduces the number of observations by roughly half. Focusing on mature funds reduces it further.
The sharpest reduction occurs when performance data is required. Only a subset of funds report summary performance measures such as IRR and TVPI, and an even smaller subset report detailed cash flows. For Preqin, roughly 15,000 mature funds have summary performance metrics, and only about 5,000 have fund-level cash flows.
These cash flows are essential for computing NPV-based measures, PMEs, and for studying NAV dynamics over time. As a result, many questions in private equity research can only be answered on relatively small subsamples.
PitchBook provides similar summary performance metrics, but its data access restrictions make it less practical for large-scale academic research. Matching funds across Preqin and PitchBook reveals that only about half of funds can be matched by name, and that for around 6% of matched funds, reported performance metrics differ. Some differences are due to reporting dates, but the existence of discrepancies is itself informative.
Bias and representativeness
A recurring concern in the literature is whether funds with reported cash flows differ systematically from those without. Earlier work documented a positive bias when moving from summary statistics to cash-flow samples.
Using current data, the paper shows that this bias appears limited in recent vintages. Funds with cash-flow data in Preqin have performance statistics that are very similar, and in some cases slightly lower, than funds without cash flows. This does not eliminate selection concerns, but suggests they may be smaller than previously feared.
An important benchmark dataset is MSCI-Burgiss, which sources data directly from a broad set of limited partners and provides cash flows for all included funds. Its coverage is smaller in terms of fund counts, but substantially larger than Preqin in terms of cash-flow availability.
Comparisons between Preqin and MSCI-Burgiss show that TVPI statistics are remarkably similar across the two datasets, while IRRs tend to be higher in Preqin. The paper emphasizes that IRR is a noisy and fragile statistic, and that differences in IRRs should be interpreted with caution.
A major drawback of MSCI-Burgiss is that it is anonymous and difficult to merge with other datasets. In addition, researchers must submit code to be run by the data provider, which limits flexibility.
Investor-level data
Preqin has expanded its coverage of limited partners and now provides investor identities for tens of thousands of funds. In principle, this enables research at the LP–fund level.
In practice, coverage is uneven. Disclosure is relatively complete for investors subject to freedom-of-information rules or regulatory reporting, such as public pension funds and some insurers. It is highly incomplete for private investors such as family offices and private endowments.
Filtering the data to retain only investors with a meaningful number of fund commitments dramatically reduces the sample. This highlights the importance of careful data cleaning and realistic expectations about representativeness.
Fund terms and conditions
Preqin also provides data on headline fund terms, including management fees, carry rates, hurdle rates, investment periods, and GP commitments. This is the most comprehensive dataset available on fund terms.
That said, many fields are sparsely populated, and some entries are clearly erroneous. More importantly, headline terms vary little across funds within a given strategy and geography. The paper stresses that fee outcomes depend heavily on definitions, exceptions, and contractual details that are not captured in structured datasets.
As a result, these data are more useful for descriptive work than for explaining variation in realized fees.
Deal-level datasets
Deal-level data offers a different perspective. Early datasets assembled from private placement memoranda enabled the first studies of deal performance, but were small and time-consuming to construct.
More recent datasets, notably MSCI-Burgiss, StepStone, and Unigestion, provide large samples of deal-level performance, including entry valuations, leverage, duration, and exit routes. These datasets confirm many earlier findings on performance persistence and value creation.
However, deal-level data comes with its own limitations. It is constructed from GP reports to LPs. Variables such as EBITDA, entry multiples, and leverage involve discretion and are subject to incentives. EBITDA has no formal accounting definition, and auditors are hired by the GP. The paper documents how reporting incentives can affect these figures.
CEPRES provides an important alternative by offering deal-level cash flows, which allow the computation of PMEs and MIRRs. It is international and less US-centric, but still relies on voluntary reporting.
Accounting data from public registries
An alternative to GP- and LP-reported data is to reconstruct performance from company accounts filed in public registries. In Europe and many other regions, private firms are required to file financial statements, which are aggregated in databases such as Orbis and FAME.
In principle, these data allow researchers to compute leverage, growth, profitability, and even equity returns independently. In practice, the paper shows that using these data is extremely challenging.
Key variables are often reported in footnotes rather than main statements. Classification is inconsistent. Distinguishing shareholder loans from third-party debt is difficult. Identifying the correct top-level entity is non-trivial, especially as offshore structures have become more common.
Accounting data also do not distinguish organic growth from acquisition-driven growth. Addressing these issues requires extensive manual work and institutional knowledge.
Conclusion
The paper does not argue that any dataset is superior in all dimensions. Instead, it documents the strengths and weaknesses of each source and emphasizes that data choice should be driven by the research question.
Private equity data has improved substantially over time, but important limitations remain. Transparency about data construction, careful validation, and modesty in interpretation remain essential.
The full paper is available on SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6013335


Really strong breakdown of the data quality problem in PE research. The point about cash flow availability shrinking from 100k funds to just 5k useable samples is somthing I've run into before when trying to backtest strategies. Its kinda wild that evn matched funds across Preqin and PitchBook show different performance metrics for the same fund. The IRR discrepancies probably matter less than people think, but the selection bias around which funds actually report is still the elephant in the room.