Skip Main Navigation
Office of Management and Budget
President's Budget
Information &
Regulatory Affairs
Legislative Information
Agency Information

MARCH 25, 2003

Mr. Chairman and Members of the Subcommittee,

Thank you for the opportunity to appear before the Subcommittee to discuss the Administration's views on data mining.

This committee has defined "data mining" as a "technology that facilitates the ability to sort through masses of information through database exploration, extract specific information in accordance with defined criteria, and then identify patterns of interest to its user." While there are many definitions of "data mining", the Committee's definition is generally accepted and helpful in defining the issue and its challenges. Additionally, data warehouses are being used as the source of data for many data mining applications. A data warehouse is a managed data repository of integrated, cleansed data whose source is mainly transactional data. Data is aggregated from various sources and structured for the use of analysis and reporting.

Commercial Types and Uses of Data Mining

The private sector uses data mining to make sense of the wide breadth of data that companies and industries have available. Some examples of these uses:

  • Customer Relationship Management/ Segmentation Analysis – Applied to Customer relationship management (CRM), data mining is used to analyze disparate customer data and provide insight into customer needs and wants. Data mining is used to analyze and segment customer buying patterns and to identify potential goods and services that are in demand. Companies that use data mining shorten response time to market changes, which allows for better alignment of their products with their customers’ needs. They do this to increase revenue performance and allocate investment to products that meet consumer demand effectively.

  • Fraud Detection – Companies use software that provides comprehensive, transaction-level financial reporting and analysis to support automatic fraud detection and proactive alerting. Software packages can also be used to detect anomalies, variances, and patterns in databases. For example, BlueCross/BlueShield and other health care payers use data mining tools to catch and prevent fraudulent and abusive billing practices. BlueCross/BlueShield’s solution can quickly search through millions of medical claims and detect inappropriate billing practices with a high degree of reliability.

  • Retail Analysis and Supply Chain Analysis – Companies such as Wal-mart are broadly recognized for analyzing sales trends. Retail analysis and supply chain analysis can be used to predict the effectiveness of promotions, decide which products to stock in each store, and help managers understand cost and revenue trends in order to adjust pricing and promotions in anticipation of changes in marketplace conditions. Data mining also allows supply chain tools to monitor and analyze inventory trends, forecast product demand for replenishment, track vendor performance and identify problems, analyze distribution network efficiency, and understand supply chain costs and inefficiencies.

  • Medical Analysis/Diagnostics – The health care industry uses analysis to predict the effectiveness of surgical procedures, medical tests, and medications. High-risk segments of the population can be identified and targeted for proactive treatment. For example, American Healthways relies on predictive modeling to identify patient types who trend toward high-risk conditions, giving care coordinators a proactive approach to healing. The result is improved quality of life for the patients and reduced stress on hospitals and insurance providers.

  • Document Analysis (Text Mining) – Documents can be searched for information and insights in a fraction of the time an individual will spend locating one document. Document analysis involves analysis of text and structured and unstructured data, organized by categories, to determine trends, pattern and relationships and organized by categories. This can be highly effective in survey analysis. Content management systems and software packages perform analyzes on an organization’s information products to help companies control information flows and work products. For example, Autonomy at BAE Systems aggregates content from many sources in many different formats, structured or unstructured, including their intranet and 10,000 news feeds per day. The goal is to personalize the delivery of that information to each user, and to eliminate work duplication and time-consuming searches. Autonomy automatically alerts BAE Systems employees to documents in the system that relate to what they're doing, or to other employees in the company whose interests and expertise match their own.

  • Use of Decision Support Systems (DSS) – Decision Support Systems may use data mining to identify trends and present the information in intuitively useful ways -- supporting more informed and effective decisions for business and organizational activities. For example, one DSS solution for HR management is now providing essential insights into The Bank of Scotland Group's HR activities worldwide, giving managers personnel and staffing information needed to make hiring and placement decisions. Managers can determine if job turnover in a particular area or occupation classification is higher than expected and investigate influences on loyalty such as the physical working environment.

  • Financial Analysis – The insurance industry uses and data mining algorithms to conduct risk analysis, such as evaluating actuarial experience studies for mortality, withdrawal and disability, dynamically calculating exposures and expectations for period ranges. For example, Canada Life performs timely and accurate actuarial studies using a data warehouse and advanced data analysis methods; the Generali Group uses data mining tools to manage financial market risk and customer credit risk via a common analytical framework for rapid and flexible analysis and reporting of risk exposure.

Government Applications of Data Mining

The Federal government analyzes data that has been collected from the public for several purposes, including determining the eligibility of applicants for Federal benefits, detecting potential instances of fraud, waste, and abuse in Federal programs, and for law enforcement activities. Some of this analysis is facilitated by data mining. Here are a few examples of agency uses of data analysis techniques and software:

  • Financial management – Poor management practices have created opportunities for a wide range of fraud and abuse in the use of government travel and purchase cards. Several agency inspector general (IG) investigations have used statistical sampling processes to document inappropriate purchases and misuse of these cards. OMB is taking and will continue to take substantive, affirmative steps to ensure agencies improve their internal control systems to monitor expenditures properly.

  • Human Resources Management – One of the 24 E-government initiatives, the Enterprise HR Integration under the Office of Personnel Management, is leading the effort to provide a government wide data warehouse of HR information to minimize the workload as employees move from one department to another. A key component of this is the E-Clearance project – OPM and its partner agencies on the E-clearance project are using data mining to more quickly access information which speeds up the overall security clearance investigation process. Given the backlog in clearances, this use of data mining is critical to our ability to get staff for effectively and rapidly through the human resources management processes.

  • Reducing Erroneous Payments and Fraud Detection – Data analysis accomplished via the matching of electronic databases between government agencies has been an important and successful tool for identifying improper payments under federal benefit and loan programs, as well as detecting potential instances of fraud, waste, and abuse in Federal programs. As highlighted in the FY 2004 President’s Budget, agencies are now required to report the extent of erroneous payments made in their major benefit programs. In addition, the last decade has shown an increased reliance and increased spending on non-discretionary social services, such as Medicare and Medicaid. These expenditures -- and therefore the potential for improper payments -- are likely to increase unless appropriate steps are taken to protect against errors and fraud. Through the President's Management Agenda initiative for improving financial performance, we are getting a handle on the problem of erroneous payments. For example, Medicare's erroneous payment rate has fallen from 6.8 percent to 6.3 percent and the Food Stamp program reduced its national error rate from 8.9 percent to 8.7 percent. Just these small rate reductions prevented the waste of almost $1 billion. Furthermore, the Administration has proposed several pieces of legislation regarding the Administration’s authority to share data that will greatly improve efforts to reduce erroneous payments.

  • Policy Analysis – The quality of policy decisions is a function of our ability to correctly analyze enormous amounts of data that describe a problem faced by modern society. For example, the Department of Education mines data from a variety of its student financial aid systems, including the Central Processing System, Pell Grant Payment System and National Student Loan Data System, permitting professionals to analyze Federal education programs quickly and easily, without the time, expense, and burden on citizens of paper-driven surveys.

  • Law enforcement and Homeland Security – Federal agencies have found data mining techniques to be an important tool for assisting law enforcement combating terrorism. For example, system such as the Department of Homeland Security’s Bureau of Customs and Border Protection operates the Automated Commercial Environment (ACE) can utilize a series of data mining tools to strengthen border security efforts. ACE will provide the IT mechanisms for making quick evaluations on whether particular people or goods should be deemed high-risk or low-risk. Also, ACE will enable the Department of Homeland Security and other Federal agencies to more precisely target for inspection or investigation the highest risk people and cargo crossing the border. Through tools such as ACE, agencies have the ability to instantaneously analyze vast amounts of data and intelligence to see links among businesses and people, thus revealing security threats that might otherwise have gone unnoticed.

  • Citizen access to government data – Search sites such as the one available at the FirstGov website provide a facility for searching vast amounts of unstructured data across the Federal government by using publicly available search engines. In addition, the Federal government conducts its own data analyses for statistical purposes and facilitates data user access to statistical data. For example, the Census Bureau's “American FactFinder System (Advanced Query)” uses a data mining tool to allow users to query Census 2000 detailed data files. The tool provides simplified access to and extraction of data.

Benefits and Pitfalls

As outlined above, the government has found a number of ways to use collected information to improve program effectiveness and to reduce misuse of taxpayer dollars. While the use of data mining techniques to access useful, timely data and to identify relationships that were previously unknown is a powerful tool for identifying errors, fraud, threats, etc., the application of such techniques to personal information raises serious questions about privacy and how it should be protected. In order for this to be accomplished, the government must continue to act in several areas:

1. Federal data analyses must be consistent with law

In the federal arena, data mining activities must be implemented consistent with the protections of the Privacy Act of 1974, as amended by the Computer Matching and Privacy Protection Act of 1988, and other privacy statutes. These statutes do not address data-mining per se, but they outline privacy principles the government must follow in data collection, including: notice and reasonable disclosure; use and purpose limitations; choice; access to government-held information, information security; redress; and oversight. Agencies are well-versed in the legal, policy, and technical requirements governing access to and sharing of personal data. Agencies may aggregate information by analyzing data across databases, a concept known as “virtual data warehousing”; however, when information can be accessed or exchanged at numerous locations by many users, a potential exists for inadvertent disclosure of personal information or misuse of personal information, by alteration or for unauthorized purposes. Agencies that adhere to the existing legal and policy structure including OMB and NIST policy guidance can protect personal information in their possession even as they participate in data-mining activities. Furthermore, the E-Government Act of 2002 requires that an agency conduct a Privacy Impact Assessment (PIA) when agencies develop or procure information technology to initiate a new online collection of information that involves personally identifiable information changing hands, such as in the case of matching.

2. Ensuring the Security of Federal IT Systems

The Federal Information Security Management Act (FISMA) provides a comprehensive framework for ensuring the effectiveness of information security controls over federal information resources, including resources that result from data mining. FISMA requires the head of each agency to periodically assess the risk and magnitude of harm that could result from unauthorized access, use, modification, or disclosure of information. The agency must then provide information security protections that are commensurate with the stated risk. Agencies are required to periodically test their information security controls and techniques to ensure that they are effectively implemented. The results of this testing are reported to OMB on an annual basis.


“Data mining” can have many uses. The Administration is strongly committed to using available technologies like data mining to serve citizens and protect citizens from other threats, while the Administration is also strongly committed to protecting the privacy of citizens when such tools are used. Through data analysis and data mining, the private sector has improved customer service and customer needs, and has been able to help customers take proactive approaches to health care. The federal government has reduced the number of erroneous payments, and has been able to determine patterns in databases that help predict both weather patterns and the spread of deadly viruses.

We need to use modern analytic tools, such as data mining, to improve government performance, from policy analysis to fraud to homeland security. We can maintain privacy and security while improving government productivity, but we must employ tools like data mining appropriately. We hope to work with this Committee to ensure that the benefits of data analysis continue to help Federal agencies to perform their missions, while protecting against the problems that aggressive and abusive data mining can cause.