eDiscovery Terms Explained

Electronic discovery (eDiscovery or e-discovery) is critical to a variety of legal processes and compliance with regulatory mandates. To work together effectively, IT pros and legal teams need a common understanding of e-discovery terms, including both legal nomenclature and related technical terms.

Here are definitions of the top eDiscovery terms:

Admissible: Relevant and reliable evidence that can be used in court.
Analytics: Refers to various technologies used for analyzing raw data in order to make conclusions about that information.
Archive: A long-term repository for the storage of records and files.
Assisted review: Use of computer technologies to identify and tag potentially responsive documents based on keywords and other metadata.
Attachment: An electronic file sent with an email message.
Attachment backup: Both the action of and the result of creating a copy of data to be stored separately from the computer system as a precaution against loss or damage of the original data.
Backup tape: Tape media used to store copies of data created as a precaution against the loss or damage of the original data.
Batch processing: The process of gathering a large amount of ESI (electronically stored information) in a single step, as opposed to using individual processes in sequence.
Big data: A collection of structured and unstructured datasets that are high in volume, velocity and variety. Big data can also refer to very large collections of ESI.
Boolean search: A system of logic developed by computer pioneer George Boole. Boolean searches use operators (such as AND, OR and NOT) combined with keywords to narrow the results returned. For example, the AND operator between two words returns only documents that contain both of the words, while the OR operator between two words returns documents containing either of the target words.
Chain of custody: Tracking of the movement, handling and location of electronic evidence chronologically from collection to production. Chain of custody helps to show that the evidence being offered has not been tampered with and is authentic.
Child document: A file that is attached to another file, such as an email attachment, a spreadsheet imbedded in a word processing document, or each zipped document in the zip file. See also parent document.
Coding: Filling out a form for each document with case-relevant information (author, date authored, date sent, recipient, date opened, etc.). Proper coding can tie all documents together with consistent identifiers to avoid confusion and make research more productive.
Computer forensics: Computer investigation and analysis techniques for capturing data and recovering deleted, encrypted, or damaged file information.
Concept search (thesaurus, synonym or related search): A search technique that returns results using not only the query word provided but also words related to it. Concept searches can be implemented using a simple thesaurus match or sophisticated statistical analysis methods.
Container file: An application or object that contains multiple other files or objects, which can be represented as files, such as an archive or a compound document with embedded or linked objects. Common container file formats are zip, rar and pst.
Culling: Using defined criteria (dates, keywords, custodians, etc.) to reduce a set of data to the most relevant electronic documents.
Custodian (data custodian): A person responsible for the aggregation, storage and use of data sets while protecting the data as defined by the organization’s security policy or its standard IT practices.
Data extraction: The process of parsing data from electronic documents to identify their metadata and body content, or the process of retrieving potentially relevant ESI and metadata from its native source to another repository.
Data mapping: The process of identifying and recording the location and types of ESI within an organization’s network. Data mapping finds or suggests associations between files that might not be apparent using other techniques.
Deduplication (de-duplication or de-duping): The process of identifying and removing copies of documents in a document collection. There are three types of deduplication: case, custodian and production.
DeNisting (deNISTing or de-nisting): Removing operating system files and other non-user created data from a result set. NIST (National Institute of Standards and Technology) provides a list of more than 40 million known files that are usually irrelevant to cases but often make up a sizable portion of a collected set of ESI.
Discovery: The process of identifying, acquiring and reviewing information that is potential evidence in a legal matter. In the U.S., also means the pre-trial process of providing documents to an opposing party.
Document family: A groups of related documents, such as an email and its attachments.
Early case assessment (ECA): The process of identifying and gathering potential evidence early in a legal matter in order to estimate the risks and costs of pursuing a particular legal course of action.
Electronic discovery (ED, digital discovery, electronic digital discovery, electronic document discovery, EDD or electronic evidence discovery): The process of finding, identifying, locating, reviewing and producing relevant ESI for litigation purposes.
Electronic evidence: Information that is stored in an electronic format and used to prove or disprove claims in a legal matter.
Electronically stored information (ESI): Data found in hard drives, online social networks, PDAs, smartphones, voicemail, text messaging applications and other electronic data stores. Under the Federal Rules of Civil Procedure (FRCP), ESI is information created, manipulated, communicated, stored and best utilized in digital form, requiring the use of computer hardware and software.
Email: Electronic messages sent or received using an application such as Microsoft Outlook or Google Gmail.
Filtering: Reducing a data set by removing documents that do not fit specified parameters, such a data range or data type.
Forensics: See Computer forensics.
FRCP (Federal Rules of Civil Procedure): Rules that apply in most civil actions heard in US District Courts, which include rules governing electronic discovery and the treatment of ESI.
Harvesting: Gathering electronic data for use in an investigation or lawsuit, preferably while maintaining file and system metadata.
Hash: An algorithm that creates a value to verify electronic documents. A hash mark serves as a digital fingerprint.
Hosting: In eDiscovery, a service provided by a third-party litigation support firm that provides access to documents relating to a particular matter within a review software platform.
Image (drive) (mirror image or mirroring): An identical copy of a hard drive, including its empty space.
Image (file): A picture copy of a document. The most common formats in eDiscovery are TIFF and PDF.
Keyword search: The process of looking for documents that contain a specified string. While standard keyword searches will match only on the exact string specified, most litigation support search engines use stemming to return additional results.
Legacy data: Information stored on software or hardware that is outmoded or obsolete, or data whose format has become obsolete.
Litigation hold (legal hold, hold order, preservation order, suspension order, freeze notice, hold notice, stop destruction notice): A notice or communication from legal counsel to an organization to suspend the normal processing of records, such as recycling of backup tapes, to avoid spoliation of evidence.
Load file: A file used to import images of documents into an electronic discovery platform, along with corresponding text and metadata files required for the documents to remain searchable. Load files are now often unnecessary because many document review platforms can ingest documents in their native format.
Media: The device on which electronic information is stored, e.g., hard drives and backup tapes.
Metadata: Data embedded in electronic files that provides information about it, such as how, when and by whom it was created, edited and processed, or the types of data it contains.
Mirror image: See Image (drive).
Native format: The format in which an electronic file was originally created. A native file format preserves metadata and other details that can be lost when documents are converted to other formats, such as when using a load file.
Near-duplicate: A documents that highly similar to another document. Near-duplicates are identified during the data reduction process to reduce the time and costs associated with review.
Normalization: The process of reformatting data into a standard format.
Optical character recognition (OCR): The process of converting a scanned document into searchable text.
Parent document: A primary document to which other documents and files in a set are attached. See also child document.
Precision: A measure of how often a query accurately predicts a document to be responsive. Low precision indicates many irrelevant documents were produced. High precision indicates that most documents produced were responsive — but does not guarantee that all relevant documents were provided. See also Recall.
Predictive coding: A process for reducing the number of non-responsive documents. Often uses a combination of machine learning, keyword search, filtering, and sampling.
Privilege: A special legal advantage or right. For example, certain communications between an individual and their attorney are protected from disclosure.
Processing: The ingestion and further handling of data. Often includes the extraction of files from pst and zip archives, separation of attachments, and conversion of files to formats the review tool can read, extraction of text and metadata, and data normalization.
Production: The delivery of ESI that meets the criteria of the discovery request in appropriate forms and using appropriate delivery mechanisms to the opposing counsel or requesting party.
PST: A file format used to store messages, calendar events and other items in Microsoft products such as Outlook and Windows Messaging. Also commonly used to refer to those files themselves (“PSTs”).
Recall: A measure of how well a query identifies responsive documents. A recall score of 100% indicates that the query has returned all responsive documents in the collection. A low recall score indicates that responsive documents were improperly excluded as non-responsive. See also Precision.
Records management: The systematic supervision and administration of digital or paper documents that are important enough to an organization to be worth ongoing maintenance, such as documents that provide evidence, have historical value or deliver other business benefits.
Redaction: The process of removing privileged, proprietary, or confidential information from a document by placing a black area over that information.
Responsiveness: A measure of whether a document is relevant to the request.
Search: The process of finding terms within data sets using specific criteria (a query). Searching can be performed by simple keyword or by concept searches that identify documents related to the query even when the query term is not present in the document.
Slack space: The unused space that exists when the data does not completely fill the space allotted for it. Slack space can contain information from prior records stored at the same physical location as current records, metadata fragments and other information useful for forensic analysis of computer systems.
Social discovery: The discovery of ESI on social media platforms like Facebook, Twitter, YouTube, LinkedIn and Instagram.
Spoliation: The alteration, deletion or partial destruction of data that might be relevant to ongoing or anticipated litigation, government investigation or audit. Failure to preserve information that may become evidence is also a spoliation.
Stemming: A keyword search technique that returns not just matches on the string specified but also grammatical variations on that string. For example, with stemming, a search for the keyword “related” would also return documents that contain “relating”, “relates”, or “relate”.
Structured data: Data that resides in relational databases, which are structured to recognize relationships between pieces of data. See also Unstructured data.
System files: An electronic file that is a part of the operating system or other control program, created by the computer. The most popular system files on a Windows computer include msdos.sys, io.sys, ntdetect.com and ntldr.
Tagging: The process of assigning classification tags to documents.
Thread (email string or chain): An initiating email and all replies and forwards.
TIFF (Tagged Image Format): A common graphic file format. The file extension related to this format is .tif. Scanned documents are often stored as TIFF images.
Unallocated space: Space on a hard drive where new data can be stored. When a specific file is marked for deletion, its space is marked as unallocated, but until the data is overwritten, it can still be retrieved.
Unicode: A standard that provides uniform digital representations of the character sets of all of the world’s languages. Unicode provides a uniform means for storing and searching for text in any language.
Unitization: The process by which an image is analyzed broken down according to logical boundaries into multiple child documents.
Unstructured data: Information that is not organized and labelled to identify meaningful relationships between components. Examples include text files, server and application logs, images, audio and video files, and emails. See also Structured data.

Ryan Brooks

Product Evangelist at Netwrix Corporation, writer, and presenter. Ryan specializes in evangelizing cybersecurity and promoting the importance of visibility into IT changes and data access. As an author, Ryan focuses on IT security trends, surveys, and industry insights.