guix/guix/store/deduplication.scm

304 lines
13 KiB
Scheme
Raw Normal View History

;;; GNU Guix --- Functional package management for GNU
;;; Copyright © 2017 Caleb Ristvedt <caleb.ristvedt@cune.org>
daemon: Do not deduplicate files smaller than 8 KiB. Files smaller than 8 KiB typically represent ~70% of the entries in /gnu/store/.links but only contribute to ~4% of the space savings afforded by deduplication. Not considering these files for deduplication speeds up file insertion in the store and, more importantly, leaves 'removeUnusedLinks' with fewer entries to traverse, thereby speeding it up proportionally. Partly fixes <https://issues.guix.gnu.org/24937>. * config-daemon.ac: Remove symlink hard link check and CAN_LINK_SYMLINK definition. * guix/store/deduplication.scm (%deduplication-minimum-size): New variable. (deduplicate)[loop]: Do not recurse when FILE's size is below %DEDUPLICATION-MINIMUM-SIZE. (dump-port): New procedure. (dump-file/deduplicate)[hash]: Turn into... [dump-and-compute-hash]: ... this thunk. Call 'deduplicate' only when SIZE is greater than %DEDUPLICATION-MINIMUM-SIZE; otherwise call 'dump-port'. * nix/libstore/gc.cc (LocalStore::removeUnusedLinks): Drop files where st.st_size < deduplicationMinSize. * nix/libstore/local-store.hh (deduplicationMinSize): New declaration. * nix/libstore/optimise-store.cc (deduplicationMinSize): New variable. (LocalStore::optimisePath_): Return when PATH is a symlink or smaller than 'deduplicationMinSize'. * tests/derivations.scm ("identical files are deduplicated"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/nar.scm ("restore-file-set with directories (signed, valid)"): Likewise. * tests/store-deduplication.scm ("deduplicate, below %deduplication-minimum-size"): New test. ("deduplicate", "deduplicate, ENOSPC"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/store.scm ("substitute, deduplication"): Likewise.
2021-11-13 20:47:15 +00:00
;;; Copyright © 2018-2021 Ludovic Courtès <ludo@gnu.org>
;;;
;;; This file is part of GNU Guix.
;;;
;;; GNU Guix is free software; you can redistribute it and/or modify it
;;; under the terms of the GNU General Public License as published by
;;; the Free Software Foundation; either version 3 of the License, or (at
;;; your option) any later version.
;;;
;;; GNU Guix is distributed in the hope that it will be useful, but
;;; WITHOUT ANY WARRANTY; without even the implied warranty of
;;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
;;; GNU General Public License for more details.
;;;
;;; You should have received a copy of the GNU General Public License
;;; along with GNU Guix. If not, see <http://www.gnu.org/licenses/>.
;;; This houses stuff we do to files when they arrive at the store - resetting
;;; timestamps, deduplicating, etc.
(define-module (guix store deduplication)
Switch to Guile-Gcrypt. This removes (guix hash) and (guix pk-crypto), which now live as part of Guile-Gcrypt (version 0.1.0.) * guix/gcrypt.scm, guix/hash.scm, guix/pk-crypto.scm, tests/hash.scm, tests/pk-crypto.scm: Remove. * configure.ac: Test for Guile-Gcrypt. Remove LIBGCRYPT and LIBGCRYPT_LIBDIR assignments. * m4/guix.m4 (GUIX_ASSERT_LIBGCRYPT_USABLE): Remove. * README: Add Guile-Gcrypt to the dependencies; move libgcrypt as "required unless --disable-daemon". * doc/guix.texi (Requirements): Likewise. * gnu/packages/bash.scm, guix/derivations.scm, guix/docker.scm, guix/git.scm, guix/http-client.scm, guix/import/cpan.scm, guix/import/cran.scm, guix/import/crate.scm, guix/import/elpa.scm, guix/import/gnu.scm, guix/import/hackage.scm, guix/import/texlive.scm, guix/import/utils.scm, guix/nar.scm, guix/pki.scm, guix/scripts/archive.scm, guix/scripts/authenticate.scm, guix/scripts/download.scm, guix/scripts/hash.scm, guix/scripts/pack.scm, guix/scripts/publish.scm, guix/scripts/refresh.scm, guix/scripts/substitute.scm, guix/store.scm, guix/store/deduplication.scm, guix/tests.scm, tests/base32.scm, tests/builders.scm, tests/challenge.scm, tests/cpan.scm, tests/crate.scm, tests/derivations.scm, tests/gem.scm, tests/nar.scm, tests/opam.scm, tests/pki.scm, tests/publish.scm, tests/pypi.scm, tests/store-deduplication.scm, tests/store.scm, tests/substitute.scm: Adjust imports. * gnu/system/vm.scm: Likewise. (guile-sqlite3&co): Rename to... (gcrypt-sqlite3&co): ... this. Add GUILE-GCRYPT. (expression->derivation-in-linux-vm)[config]: Remove. (iso9660-image)[config]: Remove. (qemu-image)[config]: Remove. (system-docker-image)[config]: Remove. * guix/scripts/pack.scm: Adjust imports. (guile-sqlite3&co): Rename to... (gcrypt-sqlite3&co): ... this. Add GUILE-GCRYPT. (self-contained-tarball)[build]: Call 'make-config.scm' without #:libgcrypt argument. (squashfs-image)[libgcrypt]: Remove. [build]: Call 'make-config.scm' without #:libgcrypt. (docker-image)[config, json]: Remove. [build]: Add GUILE-GCRYPT to the extensions Remove (guix config) from the imported modules. * guix/self.scm (specification->package): Remove "libgcrypt", add "guile-gcrypt". (compiled-guix): Remove #:libgcrypt. [guile-gcrypt]: New variable. [dependencies]: Add it. [*core-modules*]: Remove #:libgcrypt from 'make-config.scm' call. Add #:extensions. [*config*]: Remove #:libgcrypt from 'make-config.scm' call. (%dependency-variables): Remove %libgcrypt. (make-config.scm): Remove #:libgcrypt. * build-aux/build-self.scm (guile-gcrypt): New variable. (make-config.scm): Remove #:libgcrypt. (build-program)[fake-gcrypt-hash]: New variable. Add (gcrypt hash) to the imported modules. Adjust load path assignments. * gnu/packages/package-management.scm (guix)[propagated-inputs]: Add GUILE-GCRYPT. [arguments]: In 'wrap-program' phase, add GUILE-GCRYPT to the search path.
2018-08-31 15:07:07 +00:00
#:use-module (gcrypt hash)
daemon: Do not deduplicate files smaller than 8 KiB. Files smaller than 8 KiB typically represent ~70% of the entries in /gnu/store/.links but only contribute to ~4% of the space savings afforded by deduplication. Not considering these files for deduplication speeds up file insertion in the store and, more importantly, leaves 'removeUnusedLinks' with fewer entries to traverse, thereby speeding it up proportionally. Partly fixes <https://issues.guix.gnu.org/24937>. * config-daemon.ac: Remove symlink hard link check and CAN_LINK_SYMLINK definition. * guix/store/deduplication.scm (%deduplication-minimum-size): New variable. (deduplicate)[loop]: Do not recurse when FILE's size is below %DEDUPLICATION-MINIMUM-SIZE. (dump-port): New procedure. (dump-file/deduplicate)[hash]: Turn into... [dump-and-compute-hash]: ... this thunk. Call 'deduplicate' only when SIZE is greater than %DEDUPLICATION-MINIMUM-SIZE; otherwise call 'dump-port'. * nix/libstore/gc.cc (LocalStore::removeUnusedLinks): Drop files where st.st_size < deduplicationMinSize. * nix/libstore/local-store.hh (deduplicationMinSize): New declaration. * nix/libstore/optimise-store.cc (deduplicationMinSize): New variable. (LocalStore::optimisePath_): Return when PATH is a symlink or smaller than 'deduplicationMinSize'. * tests/derivations.scm ("identical files are deduplicated"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/nar.scm ("restore-file-set with directories (signed, valid)"): Likewise. * tests/store-deduplication.scm ("deduplicate, below %deduplication-minimum-size"): New test. ("deduplicate", "deduplicate, ENOSPC"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/store.scm ("substitute, deduplication"): Likewise.
2021-11-13 20:47:15 +00:00
#:use-module ((guix build utils) #:hide (dump-port))
#:use-module (guix build syscalls)
#:use-module (guix base32)
#:use-module (srfi srfi-11)
#:use-module (srfi srfi-34)
#:use-module (srfi srfi-35)
daemon: Do not deduplicate files smaller than 8 KiB. Files smaller than 8 KiB typically represent ~70% of the entries in /gnu/store/.links but only contribute to ~4% of the space savings afforded by deduplication. Not considering these files for deduplication speeds up file insertion in the store and, more importantly, leaves 'removeUnusedLinks' with fewer entries to traverse, thereby speeding it up proportionally. Partly fixes <https://issues.guix.gnu.org/24937>. * config-daemon.ac: Remove symlink hard link check and CAN_LINK_SYMLINK definition. * guix/store/deduplication.scm (%deduplication-minimum-size): New variable. (deduplicate)[loop]: Do not recurse when FILE's size is below %DEDUPLICATION-MINIMUM-SIZE. (dump-port): New procedure. (dump-file/deduplicate)[hash]: Turn into... [dump-and-compute-hash]: ... this thunk. Call 'deduplicate' only when SIZE is greater than %DEDUPLICATION-MINIMUM-SIZE; otherwise call 'dump-port'. * nix/libstore/gc.cc (LocalStore::removeUnusedLinks): Drop files where st.st_size < deduplicationMinSize. * nix/libstore/local-store.hh (deduplicationMinSize): New declaration. * nix/libstore/optimise-store.cc (deduplicationMinSize): New variable. (LocalStore::optimisePath_): Return when PATH is a symlink or smaller than 'deduplicationMinSize'. * tests/derivations.scm ("identical files are deduplicated"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/nar.scm ("restore-file-set with directories (signed, valid)"): Likewise. * tests/store-deduplication.scm ("deduplicate, below %deduplication-minimum-size"): New test. ("deduplicate", "deduplicate, ENOSPC"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/store.scm ("substitute, deduplication"): Likewise.
2021-11-13 20:47:15 +00:00
#:use-module (rnrs bytevectors)
#:use-module (rnrs io ports)
#:use-module (ice-9 ftw)
#:use-module (ice-9 match)
#:use-module (guix serialization)
#:export (nar-sha256
deduplicate
dump-file/deduplicate
copy-file/deduplicate))
daemon: Do not deduplicate files smaller than 8 KiB. Files smaller than 8 KiB typically represent ~70% of the entries in /gnu/store/.links but only contribute to ~4% of the space savings afforded by deduplication. Not considering these files for deduplication speeds up file insertion in the store and, more importantly, leaves 'removeUnusedLinks' with fewer entries to traverse, thereby speeding it up proportionally. Partly fixes <https://issues.guix.gnu.org/24937>. * config-daemon.ac: Remove symlink hard link check and CAN_LINK_SYMLINK definition. * guix/store/deduplication.scm (%deduplication-minimum-size): New variable. (deduplicate)[loop]: Do not recurse when FILE's size is below %DEDUPLICATION-MINIMUM-SIZE. (dump-port): New procedure. (dump-file/deduplicate)[hash]: Turn into... [dump-and-compute-hash]: ... this thunk. Call 'deduplicate' only when SIZE is greater than %DEDUPLICATION-MINIMUM-SIZE; otherwise call 'dump-port'. * nix/libstore/gc.cc (LocalStore::removeUnusedLinks): Drop files where st.st_size < deduplicationMinSize. * nix/libstore/local-store.hh (deduplicationMinSize): New declaration. * nix/libstore/optimise-store.cc (deduplicationMinSize): New variable. (LocalStore::optimisePath_): Return when PATH is a symlink or smaller than 'deduplicationMinSize'. * tests/derivations.scm ("identical files are deduplicated"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/nar.scm ("restore-file-set with directories (signed, valid)"): Likewise. * tests/store-deduplication.scm ("deduplicate, below %deduplication-minimum-size"): New test. ("deduplicate", "deduplicate, ENOSPC"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/store.scm ("substitute, deduplication"): Likewise.
2021-11-13 20:47:15 +00:00
;; TODO: Remove once 'dump-port' in (guix build utils) has an optional 'len'
;; parameter.
(define* (dump-port in out
#:optional len
#:key (buffer-size 16384))
"Read LEN bytes from IN (or as much as possible if LEN is #f) and write it
to OUT, using chunks of BUFFER-SIZE bytes."
(define buffer
(make-bytevector buffer-size))
(let loop ((total 0)
(bytes (get-bytevector-n! in buffer 0
(if len
(min len buffer-size)
buffer-size))))
(or (eof-object? bytes)
(and len (= total len))
(let ((total (+ total bytes)))
(put-bytevector out buffer 0 bytes)
(loop total
(get-bytevector-n! in buffer 0
(if len
(min (- len total) buffer-size)
buffer-size)))))))
(define (nar-sha256 file)
"Gives the sha256 hash of a file and the size of the file in nar form."
(let-values (((port get-hash) (open-sha256-port)))
(write-file file port)
(force-output port)
(let ((hash (get-hash))
(size (port-position port)))
(close-port port)
(values hash size))))
(define (tempname-in directory)
"Gives an unused temporary name under DIRECTORY. Not guaranteed to still be
unused by the time you create anything with that name, but a good shot."
(let ((const-part (string-append directory "/.tmp-link-"
(number->string (getpid)))))
(let try ((guess-part
(number->string (random most-positive-fixnum) 16)))
(if (file-exists? (string-append const-part "-" guess-part))
(try (number->string (random most-positive-fixnum) 16))
(string-append const-part "-" guess-part)))))
(define* (get-temp-link target #:optional (link-prefix (dirname target)))
"Like mkstemp!, but instead of creating a new file and giving you the name,
it creates a new hardlink to TARGET and gives you the name. Since
cross-file-system hardlinks don't work, the temp link must be created on the
same file system - where in that file system it is can be controlled by
LINK-PREFIX."
(let try ((tempname (tempname-in link-prefix)))
(catch 'system-error
(lambda ()
(link target tempname)
tempname)
(lambda args
(if (= (system-error-errno args) EEXIST)
(try (tempname-in link-prefix))
(apply throw args))))))
(define (call-with-writable-file file store thunk)
(if (string=? file store)
(thunk) ;don't meddle with the store's permissions
(let ((stat (lstat file)))
(dynamic-wind
(lambda ()
(make-file-writable file))
thunk
(lambda ()
(set-file-time file stat)
(chmod file (stat:mode stat)))))))
(define-syntax-rule (with-writable-file file store exp ...)
"Make FILE writable for the dynamic extent of EXP..., except if FILE is the
store."
(call-with-writable-file file store (lambda () exp ...)))
;; There are 3 main kinds of errors we can get from hardlinking: "Too many
;; things link to this" (EMLINK), "this link already exists" (EEXIST), and
;; "can't fit more stuff in this directory" (ENOSPC).
(define* (replace-with-link target to-replace
#:key (swap-directory (dirname target))
(store (%store-directory)))
"Atomically replace the file TO-REPLACE with a link to TARGET. Use
SWAP-DIRECTORY as the directory to store temporary hard links. Upon ENOSPC
and EMLINK, TO-REPLACE is left unchanged.
Note: TARGET, TO-REPLACE, and SWAP-DIRECTORY must be on the same file system."
(define temp-link
(catch 'system-error
(lambda ()
(get-temp-link target swap-directory))
(lambda args
;; We get ENOSPC when we can't fit an additional entry in
;; SWAP-DIRECTORY. If it's EMLINK, then TARGET has reached its
;; maximum number of links.
(if (memv (system-error-errno args) `(,ENOSPC ,EMLINK))
#f
(apply throw args)))))
;; If we couldn't create TEMP-LINK, that's OK: just don't do the
;; replacement, which means TO-REPLACE won't be deduplicated.
(when temp-link
(with-writable-file (dirname to-replace) store
(catch 'system-error
(lambda ()
(rename-file temp-link to-replace))
(lambda args
(delete-file temp-link)
(unless (= EMLINK (system-error-errno args))
(apply throw args)))))))
daemon: Do not deduplicate files smaller than 8 KiB. Files smaller than 8 KiB typically represent ~70% of the entries in /gnu/store/.links but only contribute to ~4% of the space savings afforded by deduplication. Not considering these files for deduplication speeds up file insertion in the store and, more importantly, leaves 'removeUnusedLinks' with fewer entries to traverse, thereby speeding it up proportionally. Partly fixes <https://issues.guix.gnu.org/24937>. * config-daemon.ac: Remove symlink hard link check and CAN_LINK_SYMLINK definition. * guix/store/deduplication.scm (%deduplication-minimum-size): New variable. (deduplicate)[loop]: Do not recurse when FILE's size is below %DEDUPLICATION-MINIMUM-SIZE. (dump-port): New procedure. (dump-file/deduplicate)[hash]: Turn into... [dump-and-compute-hash]: ... this thunk. Call 'deduplicate' only when SIZE is greater than %DEDUPLICATION-MINIMUM-SIZE; otherwise call 'dump-port'. * nix/libstore/gc.cc (LocalStore::removeUnusedLinks): Drop files where st.st_size < deduplicationMinSize. * nix/libstore/local-store.hh (deduplicationMinSize): New declaration. * nix/libstore/optimise-store.cc (deduplicationMinSize): New variable. (LocalStore::optimisePath_): Return when PATH is a symlink or smaller than 'deduplicationMinSize'. * tests/derivations.scm ("identical files are deduplicated"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/nar.scm ("restore-file-set with directories (signed, valid)"): Likewise. * tests/store-deduplication.scm ("deduplicate, below %deduplication-minimum-size"): New test. ("deduplicate", "deduplicate, ENOSPC"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/store.scm ("substitute, deduplication"): Likewise.
2021-11-13 20:47:15 +00:00
(define %deduplication-minimum-size
;; Size below which files are not deduplicated. This avoids adding too many
;; entries to '.links', which would slow down 'removeUnusedLinks' while
;; saving little space. Keep in sync with optimize-store.cc.
8192)
(define* (deduplicate path hash #:key (store (%store-directory)))
"Check if a store item with sha256 hash HASH already exists. If so,
replace PATH with a hardlink to the already-existing one. If not, register
PATH so that future duplicates can hardlink to it. PATH is assumed to be
under STORE."
daemon: Do not deduplicate files smaller than 8 KiB. Files smaller than 8 KiB typically represent ~70% of the entries in /gnu/store/.links but only contribute to ~4% of the space savings afforded by deduplication. Not considering these files for deduplication speeds up file insertion in the store and, more importantly, leaves 'removeUnusedLinks' with fewer entries to traverse, thereby speeding it up proportionally. Partly fixes <https://issues.guix.gnu.org/24937>. * config-daemon.ac: Remove symlink hard link check and CAN_LINK_SYMLINK definition. * guix/store/deduplication.scm (%deduplication-minimum-size): New variable. (deduplicate)[loop]: Do not recurse when FILE's size is below %DEDUPLICATION-MINIMUM-SIZE. (dump-port): New procedure. (dump-file/deduplicate)[hash]: Turn into... [dump-and-compute-hash]: ... this thunk. Call 'deduplicate' only when SIZE is greater than %DEDUPLICATION-MINIMUM-SIZE; otherwise call 'dump-port'. * nix/libstore/gc.cc (LocalStore::removeUnusedLinks): Drop files where st.st_size < deduplicationMinSize. * nix/libstore/local-store.hh (deduplicationMinSize): New declaration. * nix/libstore/optimise-store.cc (deduplicationMinSize): New variable. (LocalStore::optimisePath_): Return when PATH is a symlink or smaller than 'deduplicationMinSize'. * tests/derivations.scm ("identical files are deduplicated"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/nar.scm ("restore-file-set with directories (signed, valid)"): Likewise. * tests/store-deduplication.scm ("deduplicate, below %deduplication-minimum-size"): New test. ("deduplicate", "deduplicate, ENOSPC"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/store.scm ("substitute, deduplication"): Likewise.
2021-11-13 20:47:15 +00:00
;; Lightweight promises.
(define-syntax-rule (delay exp)
(let ((value #f))
(lambda ()
(unless value
(set! value exp))
value)))
(define-syntax-rule (force promise)
(promise))
(define links-directory
(string-append store "/.links"))
(let loop ((path path)
(type (stat:type (lstat path)))
(hash hash))
(if (eq? 'directory type)
;; Can't hardlink directories, so hardlink their atoms.
(for-each (match-lambda
((file . properties)
(unless (member file '("." ".."))
(let* ((file (string-append path "/" file))
daemon: Do not deduplicate files smaller than 8 KiB. Files smaller than 8 KiB typically represent ~70% of the entries in /gnu/store/.links but only contribute to ~4% of the space savings afforded by deduplication. Not considering these files for deduplication speeds up file insertion in the store and, more importantly, leaves 'removeUnusedLinks' with fewer entries to traverse, thereby speeding it up proportionally. Partly fixes <https://issues.guix.gnu.org/24937>. * config-daemon.ac: Remove symlink hard link check and CAN_LINK_SYMLINK definition. * guix/store/deduplication.scm (%deduplication-minimum-size): New variable. (deduplicate)[loop]: Do not recurse when FILE's size is below %DEDUPLICATION-MINIMUM-SIZE. (dump-port): New procedure. (dump-file/deduplicate)[hash]: Turn into... [dump-and-compute-hash]: ... this thunk. Call 'deduplicate' only when SIZE is greater than %DEDUPLICATION-MINIMUM-SIZE; otherwise call 'dump-port'. * nix/libstore/gc.cc (LocalStore::removeUnusedLinks): Drop files where st.st_size < deduplicationMinSize. * nix/libstore/local-store.hh (deduplicationMinSize): New declaration. * nix/libstore/optimise-store.cc (deduplicationMinSize): New variable. (LocalStore::optimisePath_): Return when PATH is a symlink or smaller than 'deduplicationMinSize'. * tests/derivations.scm ("identical files are deduplicated"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/nar.scm ("restore-file-set with directories (signed, valid)"): Likewise. * tests/store-deduplication.scm ("deduplicate, below %deduplication-minimum-size"): New test. ("deduplicate", "deduplicate, ENOSPC"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/store.scm ("substitute, deduplication"): Likewise.
2021-11-13 20:47:15 +00:00
(st (delay (lstat file)))
(type (match (assoc-ref properties 'type)
((or 'unknown #f)
daemon: Do not deduplicate files smaller than 8 KiB. Files smaller than 8 KiB typically represent ~70% of the entries in /gnu/store/.links but only contribute to ~4% of the space savings afforded by deduplication. Not considering these files for deduplication speeds up file insertion in the store and, more importantly, leaves 'removeUnusedLinks' with fewer entries to traverse, thereby speeding it up proportionally. Partly fixes <https://issues.guix.gnu.org/24937>. * config-daemon.ac: Remove symlink hard link check and CAN_LINK_SYMLINK definition. * guix/store/deduplication.scm (%deduplication-minimum-size): New variable. (deduplicate)[loop]: Do not recurse when FILE's size is below %DEDUPLICATION-MINIMUM-SIZE. (dump-port): New procedure. (dump-file/deduplicate)[hash]: Turn into... [dump-and-compute-hash]: ... this thunk. Call 'deduplicate' only when SIZE is greater than %DEDUPLICATION-MINIMUM-SIZE; otherwise call 'dump-port'. * nix/libstore/gc.cc (LocalStore::removeUnusedLinks): Drop files where st.st_size < deduplicationMinSize. * nix/libstore/local-store.hh (deduplicationMinSize): New declaration. * nix/libstore/optimise-store.cc (deduplicationMinSize): New variable. (LocalStore::optimisePath_): Return when PATH is a symlink or smaller than 'deduplicationMinSize'. * tests/derivations.scm ("identical files are deduplicated"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/nar.scm ("restore-file-set with directories (signed, valid)"): Likewise. * tests/store-deduplication.scm ("deduplicate, below %deduplication-minimum-size"): New test. ("deduplicate", "deduplicate, ENOSPC"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/store.scm ("substitute, deduplication"): Likewise.
2021-11-13 20:47:15 +00:00
(stat:type (force st)))
(type type))))
daemon: Do not deduplicate files smaller than 8 KiB. Files smaller than 8 KiB typically represent ~70% of the entries in /gnu/store/.links but only contribute to ~4% of the space savings afforded by deduplication. Not considering these files for deduplication speeds up file insertion in the store and, more importantly, leaves 'removeUnusedLinks' with fewer entries to traverse, thereby speeding it up proportionally. Partly fixes <https://issues.guix.gnu.org/24937>. * config-daemon.ac: Remove symlink hard link check and CAN_LINK_SYMLINK definition. * guix/store/deduplication.scm (%deduplication-minimum-size): New variable. (deduplicate)[loop]: Do not recurse when FILE's size is below %DEDUPLICATION-MINIMUM-SIZE. (dump-port): New procedure. (dump-file/deduplicate)[hash]: Turn into... [dump-and-compute-hash]: ... this thunk. Call 'deduplicate' only when SIZE is greater than %DEDUPLICATION-MINIMUM-SIZE; otherwise call 'dump-port'. * nix/libstore/gc.cc (LocalStore::removeUnusedLinks): Drop files where st.st_size < deduplicationMinSize. * nix/libstore/local-store.hh (deduplicationMinSize): New declaration. * nix/libstore/optimise-store.cc (deduplicationMinSize): New variable. (LocalStore::optimisePath_): Return when PATH is a symlink or smaller than 'deduplicationMinSize'. * tests/derivations.scm ("identical files are deduplicated"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/nar.scm ("restore-file-set with directories (signed, valid)"): Likewise. * tests/store-deduplication.scm ("deduplicate, below %deduplication-minimum-size"): New test. ("deduplicate", "deduplicate, ENOSPC"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/store.scm ("substitute, deduplication"): Likewise.
2021-11-13 20:47:15 +00:00
(when (or (eq? 'directory type)
(and (eq? 'regular type)
(>= (stat:size (force st))
%deduplication-minimum-size)))
(loop file type
(and (not (eq? 'directory type))
(nar-sha256 file))))))))
(scandir* path))
(let ((link-file (string-append links-directory "/"
(bytevector->nix-base32-string hash))))
(if (file-exists? link-file)
(replace-with-link link-file path
#:swap-directory links-directory
#:store store)
(catch 'system-error
(lambda ()
(link path link-file))
(lambda args
(let ((errno (system-error-errno args)))
(cond ((= errno EEXIST)
;; Someone else put an entry for PATH in
;; LINKS-DIRECTORY before we could. Let's use it.
(replace-with-link path link-file
#:swap-directory
links-directory
#:store store))
((= errno ENOENT)
;; This most likely means that LINKS-DIRECTORY does
;; not exist. Attempt to create it and try again.
(mkdir-p links-directory)
(loop path type hash))
((= errno ENOSPC)
;; There's not enough room in the directory index for
;; more entries in .links, but that's fine: we can
;; just stop.
#f)
((= errno EMLINK)
;; PATH has reached the maximum number of links, but
;; that's OK: we just can't deduplicate it more.
#f)
(else (apply throw args)))))))))))
(define (tee input len output)
"Return a port that reads up to LEN bytes from INPUT and writes them to
OUTPUT as it goes."
(define bytes-read 0)
(define (fail)
;; Reached EOF before we had read LEN bytes from INPUT.
(raise (condition
(&nar-error (port input)
(file (port-filename output))))))
(define (read! bv start count)
;; Read at most LEN bytes in total.
(let ((count (min count (- len bytes-read))))
(let loop ((ret (get-bytevector-n! input bv start count)))
(cond ((eof-object? ret)
(if (= bytes-read len)
0 ; EOF
(fail)))
((and (zero? ret) (> count 0))
;; Do not return zero since zero means EOF, so try again.
(loop (get-bytevector-n! input bv start count)))
(else
(put-bytevector output bv start ret)
(set! bytes-read (+ bytes-read ret))
ret)))))
(make-custom-binary-input-port "tee input port" read! #f #f #f))
(define* (dump-file/deduplicate file input size type
#:key (store (%store-directory)))
"Write SIZE bytes read from INPUT to FILE. TYPE is a symbol, either
'regular or 'executable.
This procedure is suitable as a #:dump-file argument to 'restore-file'. When
used that way, it deduplicates files on the fly as they are restored, thereby
daemon: Do not deduplicate files smaller than 8 KiB. Files smaller than 8 KiB typically represent ~70% of the entries in /gnu/store/.links but only contribute to ~4% of the space savings afforded by deduplication. Not considering these files for deduplication speeds up file insertion in the store and, more importantly, leaves 'removeUnusedLinks' with fewer entries to traverse, thereby speeding it up proportionally. Partly fixes <https://issues.guix.gnu.org/24937>. * config-daemon.ac: Remove symlink hard link check and CAN_LINK_SYMLINK definition. * guix/store/deduplication.scm (%deduplication-minimum-size): New variable. (deduplicate)[loop]: Do not recurse when FILE's size is below %DEDUPLICATION-MINIMUM-SIZE. (dump-port): New procedure. (dump-file/deduplicate)[hash]: Turn into... [dump-and-compute-hash]: ... this thunk. Call 'deduplicate' only when SIZE is greater than %DEDUPLICATION-MINIMUM-SIZE; otherwise call 'dump-port'. * nix/libstore/gc.cc (LocalStore::removeUnusedLinks): Drop files where st.st_size < deduplicationMinSize. * nix/libstore/local-store.hh (deduplicationMinSize): New declaration. * nix/libstore/optimise-store.cc (deduplicationMinSize): New variable. (LocalStore::optimisePath_): Return when PATH is a symlink or smaller than 'deduplicationMinSize'. * tests/derivations.scm ("identical files are deduplicated"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/nar.scm ("restore-file-set with directories (signed, valid)"): Likewise. * tests/store-deduplication.scm ("deduplicate, below %deduplication-minimum-size"): New test. ("deduplicate", "deduplicate, ENOSPC"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/store.scm ("substitute, deduplication"): Likewise.
2021-11-13 20:47:15 +00:00
removing the need for a deduplication pass that would re-read all the files
down the road."
daemon: Do not deduplicate files smaller than 8 KiB. Files smaller than 8 KiB typically represent ~70% of the entries in /gnu/store/.links but only contribute to ~4% of the space savings afforded by deduplication. Not considering these files for deduplication speeds up file insertion in the store and, more importantly, leaves 'removeUnusedLinks' with fewer entries to traverse, thereby speeding it up proportionally. Partly fixes <https://issues.guix.gnu.org/24937>. * config-daemon.ac: Remove symlink hard link check and CAN_LINK_SYMLINK definition. * guix/store/deduplication.scm (%deduplication-minimum-size): New variable. (deduplicate)[loop]: Do not recurse when FILE's size is below %DEDUPLICATION-MINIMUM-SIZE. (dump-port): New procedure. (dump-file/deduplicate)[hash]: Turn into... [dump-and-compute-hash]: ... this thunk. Call 'deduplicate' only when SIZE is greater than %DEDUPLICATION-MINIMUM-SIZE; otherwise call 'dump-port'. * nix/libstore/gc.cc (LocalStore::removeUnusedLinks): Drop files where st.st_size < deduplicationMinSize. * nix/libstore/local-store.hh (deduplicationMinSize): New declaration. * nix/libstore/optimise-store.cc (deduplicationMinSize): New variable. (LocalStore::optimisePath_): Return when PATH is a symlink or smaller than 'deduplicationMinSize'. * tests/derivations.scm ("identical files are deduplicated"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/nar.scm ("restore-file-set with directories (signed, valid)"): Likewise. * tests/store-deduplication.scm ("deduplicate, below %deduplication-minimum-size"): New test. ("deduplicate", "deduplicate, ENOSPC"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/store.scm ("substitute, deduplication"): Likewise.
2021-11-13 20:47:15 +00:00
(define (dump-and-compute-hash)
(call-with-output-file file
(lambda (output)
(let-values (((hash-port get-hash)
(open-hash-port (hash-algorithm sha256))))
(write-file-tree file hash-port
#:file-type+size (lambda (_) (values type size))
#:file-port
(const (tee input size output)))
(close-port hash-port)
(get-hash)))))
daemon: Do not deduplicate files smaller than 8 KiB. Files smaller than 8 KiB typically represent ~70% of the entries in /gnu/store/.links but only contribute to ~4% of the space savings afforded by deduplication. Not considering these files for deduplication speeds up file insertion in the store and, more importantly, leaves 'removeUnusedLinks' with fewer entries to traverse, thereby speeding it up proportionally. Partly fixes <https://issues.guix.gnu.org/24937>. * config-daemon.ac: Remove symlink hard link check and CAN_LINK_SYMLINK definition. * guix/store/deduplication.scm (%deduplication-minimum-size): New variable. (deduplicate)[loop]: Do not recurse when FILE's size is below %DEDUPLICATION-MINIMUM-SIZE. (dump-port): New procedure. (dump-file/deduplicate)[hash]: Turn into... [dump-and-compute-hash]: ... this thunk. Call 'deduplicate' only when SIZE is greater than %DEDUPLICATION-MINIMUM-SIZE; otherwise call 'dump-port'. * nix/libstore/gc.cc (LocalStore::removeUnusedLinks): Drop files where st.st_size < deduplicationMinSize. * nix/libstore/local-store.hh (deduplicationMinSize): New declaration. * nix/libstore/optimise-store.cc (deduplicationMinSize): New variable. (LocalStore::optimisePath_): Return when PATH is a symlink or smaller than 'deduplicationMinSize'. * tests/derivations.scm ("identical files are deduplicated"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/nar.scm ("restore-file-set with directories (signed, valid)"): Likewise. * tests/store-deduplication.scm ("deduplicate, below %deduplication-minimum-size"): New test. ("deduplicate", "deduplicate, ENOSPC"): Produce files bigger than %DEDUPLICATION-MINIMUM-SIZE. * tests/store.scm ("substitute, deduplication"): Likewise.
2021-11-13 20:47:15 +00:00
(if (>= size %deduplication-minimum-size)
(deduplicate file (dump-and-compute-hash) #:store store)
(call-with-output-file file
(lambda (output)
(dump-port input output size)))))
(define* (copy-file/deduplicate source target
#:key (store (%store-directory)))
"Like 'copy-file', but additionally deduplicate TARGET in STORE."
(call-with-input-file source
(lambda (input)
(let ((stat (stat input)))
(dump-file/deduplicate target input (stat:size stat)
(if (zero? (logand (stat:mode stat)
#o100))
'regular
'executable)
#:store store)))))