Curl Download Only New Files: Tips and Tricks for Saving Bandwidth and Time

mendtusenhychoshin
Aug 12, 2023
7 min read

I need to download a file from a HTTP server, but only if it changed since the last time I downloaded it (e.g. via the If-Modified-Since header). I also need to use a custom name for the file on my disk.

Curl Download Only New Files

Download File: https://3riamigatheoge.blogspot.com/?bk=2vAkGP

A similar approach to "date check" (with "curl --time-cond"), would be to download according to file size comparison, i.e. Download only if the local file has a different size than the remote file.

Besides the display of a progress indicator (which I explain below), you don't have much indication of what curl actually downloaded. So let's confirm that a file named my.file was actually downloaded.

In the example of curl, the author apparently believes that it's important to tell the user the progress of the download. For a very small file, that status display is not terribly helpful. Let's try it with a bigger file (this is the baby names file from the Social Security Administration) to see how the progress indicator animates:

But what if we wanted to send the contents of a web file to another program? Maybe to wc, which is used to count words and lines? Then we can use the powerful Unix feature of pipes. In this example, I'm using curl's silent option so that only the output of wc (and not the progress indicator) is seen. Also, I'm using the -l option for wc to just get the number of lines in the HTML for example.com:

But not only is that less elegant, it also requires creating a new file called temp.file. Now, this is a trivial concern, but someday, you may work with systems and data flows in which temporarily saving a file is not an available luxury (think of massive files).

You can use curl option -C -. This option is used to resume a broken download, but will skip the download if the file is already complete. Note that the argument to -C is a single dash. A disadvantage might be that curl still briefly contacts the remote server to ask for the file size.

For downloading files from a directory listing, use -r (recursive), -np (don't follow links to parent directories), and -k to make links in downloaded HTML or CSS point to local files (credit @xaccrocheur).

curl can only read single web pages files, the bunch of lines you got is actually the directory index (which you also see in your browser if you go to that URL). To use curl and some Unix tools magic to get the files you could use something like

curl supports both HTTP and SOCKS proxy servers, with optional authentication. It does not have special support for FTP proxy servers since there are no standards for those, but it can still be made to work with many of them. You can also use both HTTP and SOCKS proxies to transfer files to and from FTP servers.

Different protocols provide different ways of getting detailed information about specific files/documents. To get curl to show detailed information about a single file, you should use -I/--head option. It displays all available info on a single file for HTTP and FTP. The HTTP information is a lot more extensive.

If the content-type is not specified, curl will try to guess from the file extension (it only knows a few), or use the previously specified type (from an earlier file if several files are specified in a list) or else it will use the default type application/octet-stream.

curl is also capable of using client certificates to get/post files from sites that require valid certificates. The only drawback is that the certificate needs to be in PEM-format. PEM is a standard and open format to store certificates with, but it is not used by the most commonly used browsers. If you want curl to use the certificates you use with your favorite browser, you may need to download/compile a converter that can convert your browser's formatted certificates to PEM formatted ones.

Unix introduced the .netrc concept a long time ago. It is a way for a user to specify name and password for commonly visited FTP sites in a file so that you do not have to type them in each time you visit those sites. You realize this is a big security risk if someone else gets hold of your passwords, therefore most Unix programs will not read this file unless it is only readable by yourself (curl does not care though).

As is mentioned above, you can download multiple files with one command line by simply adding more URLs. If you want those to get saved to a local file instead of just printed to stdout, you need to add one save option for each URL you specify. Note that this also goes for the -O option (but not --remote-name-all).

If you are working in a hybrid IT environment, you often need to download or upload files from or to the cloud in your PowerShell scripts. If you only use Windows servers that communicate through the Server Message Block (SMB) protocol, you can simply use the Copy-Item cmdlet to copy the file from a network share:

In the example, we just download the HTML page that the web server at www.contoso.com generates. Note that, if you only specify the folder without the file name, as you can do with Copy-Item, PowerShell will error:

If you have a webserver where directory browsing is allowed, I guess you could use invoke-webrequest/invoke-restmethod to that folder which would list available files. Then you could parse the output and ask for specific files to be downloaded (or all of them). But I dont see any straight-forward way.

This works fine but I cannot step through this content. When I put this content through a foreach loop it dumps every line at once. If I save it to a file then I can use System.IO.File::ReadLines to steps through line by line but that only works if I download the file. How can I accomplish this without downloading the file?

I am trying to download files from a site, sadly they are be generated to include the Epoch Unix timestamp in the file name. example: Upload_Result_20210624_1624549986563.txt system_Result_20210624_1624549986720.csv

A typical scenario for downloading files on a regular basis could be if you want to save a web server's backups locally. In this case, it is convenient if the whole process is performed without user intervention. Then, the FTP download could run automatically as a scheduled task.

If you are not in the correct remote directory, you can change to that directory with cd. The download is then done with get, or mget if you want to download multiple files. The latter supports wildcards, so enter the following command in the above file:

curl lets you quickly download files from a remote system. curl supports many different protocols and can also make more complex web requests, including interacting with remote APIs to send and receive data.

If true and dest is not a directory, will download the file every time and replace the file if the contents change. If false, the file will only be downloaded if the destination does not exist. Generally should be true only for small local files.

By default this module uses atomic operations to prevent data corruption or inconsistent reads from the target filesystem objects, but sometimes systems are configured or just broken in ways that prevent this. One example is docker mounted filesystem objects, which cannot be updated atomically from inside the container and can only be written in an unsafe manner.

Searches and reports performed on this RCSB PDB website utilize data from the PDB archive. The PDB archive is maintained by the wwPDB at the main archive, files.wwpdb.org (data download details) and the versioned archive, files-versioned.wwpdb.org (versioning details).

All data are available via HTTPS and FTP. Note that FTP users should switch to binary mode before downloading data files. Note also that most web browsers (e.g., Chrome) have dropped support for FTP. You will need a separate FTP client for downloading via FTP protocol.

PDB entry files are available in several file formats (PDB, PDBx/mmCIF, XML, BinaryCIF), compressed or uncompressed, and with an option to download a file containing only "header" information (summary data, no coordinates).

The JFrog CLI offers enormous flexibility in how you download, upload, copy, or move files through use of wildcard or regular expressions with placeholders.

By default, the command only downloads files which are cached on the current Artifactory instance. It does not download files located on remote Artifactory instances, through remote or virtual repositories. To allow the command to download files from remote Artifactory instances, which are proxied by the use of remote repositories, set the JFROG_CLI_TRANSITIVE_DOWNLOAD_EXPERIMENTAL environment variable to true. This functionality requires version 7.17 or above of Artifactory.

The remote download functionality is supported only on remote repositories which proxy repositories on remote Artifactory instances. Downloading through a remote repository which proxies non Artifactory repositories is not supported.

If placeholders are used, and you would like the local file-system (download path) to be determined by placeholders only, or in other words, avoid concatenating the Artifactory folder hierarchy local, set to false.

The minimum size permitted for splitting. Files larger than the specified number will be split into equally sized --split-count segments. Any files smaller than the specified number will be downloaded in a single thread. If set to -1, files are not split.

If the target path ends with a slash, the path is assumed to be a directory. For example, if you specify the target as "repo-name/a/b/", then "b" is assumed to be a directory into which files should be downloaded. If there is no terminal slash, the target path is assumed to be a file to which the downloaded file should be renamed. For example, if you specify the target as "a/b", the downloaded file is renamed to "b".

The download command, as well as other commands which download dependencies from Artifactory accept the --build-name and --build-number command options. Adding these options records the downloaded files as build dependencies. In some cases however, it is necessary to add a file, which has been downloaded by another tool, to a build. Use the build-add-dependencies command to to this. 2ff7e9595c

Curl Download Only New Files: Tips and Tricks for Saving Bandwidth and Time

Curl Download Only New Files

Recent Posts

Comments