Downloading Data with curl
curl (Client for URLs) transfers data to and from servers. It’s commonly used to download files from HTTP sites and FTP servers, although it supports many other protocols.
Check whether curl is installed:
man curl
- If you see
curl command not found, install it first. - When it is installed, the manual page displays sections such as Name, Synopsis, and Description.
- Press Enter to move through the manual and
qto quit.
Basic syntax:
curl [option flags] [URL]
curl supports HTTP, HTTPS, FTP, SFTP, and more.
curl -O https://websitename.com/datafilename.txt
# Save with a different name
curl -o renamedatafilename.txt https://websitename.com/datafilename.txt
Download multiple files with wildcards:
curl -O https://websitename.com/datafilename*.txt
curl -O https://websitename.com/datafilename[001-100].txt
curl -O https://websitename.com/datafilename[001-100:10].txt
Useful flags:
-L: follow redirects (HTTP 3xx).-C: resume a transfer that was interrupted.
curl -L -O -C - https://websitename.com/datafilename[001-100].txt
Downloading Data with wget
wget (“web get”) is a versatile downloader that can fetch single files, entire folders, or whole web pages. It supports recursive downloads.
Locate wget:
which wget
# returns nothing if it's not installed
Install:
# Linux
sudo apt-get install wget
# macOS
brew install wget
# Windows
Install it via the GnuWin32 package.
Basic usage:
wget [option flags] [URL]
Common flags:
-b: run the download in the background.-q: quiet mode (suppress output).-c: resume a partial download.
wget -bqc https://websitename.com/datafilename.txt
Advanced Downloading with wget
Download every URL listed in a file:
cat url_list.txt
wget -i url_list.txt
Limit the transfer rate (--limit-rate is bytes per second, so append k for kilobytes):
wget --limit-rate=200k -i url_list.txt
Add delays between requests:
wget --wait=2.5 -i url_list.txt
Getting Started with csvkit
Install the toolkit:
pip install csvkit
View help:
in2csv --help
in2csv -h
Convert Excel to CSV:
in2csv SpotifyData.xlsx > SpotifyData.csv
# Print the first sheet to stdout without saving
in2csv SpotifyData.xlsx
in2csv -n SpotifyData.xlsx # list sheet names
in2csv SpotifyData.xlsx --sheet "Worksheet1_Popularity" > Spotify_Popularity.csv
csvlook pretty-prints CSV data:
csvlook -h
csvlook Spotify_Popularity.csv
Summaries with csvstat:
csvstat Spotify_Popularity.csv
Filtering Data with csvkit
csvcut -h
csvcut -n Spotify_MusicAttributes.csv # show column indices
csvcut -c 1 Spotify_MusicAttributes.csv
csvcut -c "track_id" Spotify_MusicAttributes.csv
csvcut -c 2,3 Spotify_MusicAttributes.csv
csvcut -c "danceability","duration_ms" Spotify_MusicAttributes.csv
csvgrep filters rows:
csvgrep -h
csvgrep -c "track_id" -m 5RCPsfzmEpTXMCTNk7wEfQ Spotify_MusicAttributes.csv
csvgrep -c 1 -m 5RCPsfzmEpTXMCTNk7wEfQ Spotify_MusicAttributes.csv
Stacking Data and Chaining Commands with csvkit
csvstack -h
# Stack files with identical columns
csvstack Spotify_Rank6.csv Spotify_Rank7.csv > Spotify_AllRanks.csv
csvstack -g "Rank6","Rank7" Spotify_Rank6.csv Spotify_Rank7.csv > Spotify_AllRanks.csv
csvstack -g "Rank6","Rank7" -n "source" Spotify_Rank6.csv Spotify_Rank7.csv > Spotify_AllRanks.csv
# Commands separated with ; run sequentially
csvlook SpotifyData_All.csv; csvstat SpotifyData_All.csv
# Commands joined with && execute the second only if the first succeeds
csvlook SpotifyData_All.csv && csvstat SpotifyData_All.csv
csvcut -c "track_id","danceability" Spotify_Popularity.csv | csvlook
Pulling Data from Databases
sql2csv can export results from popular SQL databases.
sql2csv -h
sql2csv --db "sqlite:///SpotifyDatabase.db" \
--query "SELECT * FROM Spotify_Popularity" \
> Spotify_Popularity.csv
Manipulating Data with SQL Syntax
csvsql --query "SELECT * FROM Spotify_MusicAttributes LIMIT 1" Spotify_MusicAttributes.csv
csvsql --query "SELECT * FROM Spotify_MusicAttributes LIMIT 1" \
data/Spotify_MusicAttributes.csv | csvlook
csvsql --query "SELECT * FROM Spotify_MusicAttributes LIMIT 1" \
data/Spotify_MusicAttributes.csv > OneSongFile.csv
csvsql --query "SELECT * FROM file_a INNER JOIN file_b ..." file_a.csv file_b.csv
Pushing Data Back to a Database
csvsql --db "sqlite:///SpotifyDatabase.db" \
--insert Spotify_MusicAttributes.csv
csvsql --no-inference --no-constraints \
--db "sqlite:///SpotifyDatabase.db" \
--insert Spotify_MusicAttributes.csv
Python on the Command Line
man python
python --version # check the Python version
which python # find the interpreter path
python
>>> print("hello world")
>>> exit()
echo "print('hello world')" > hello_world.py
Installing Python Packages with pip
pip install --upgrade pip
pip list # show installed packages
pip install scikit-learn
pip install scikit-learn==0.19.2
pip install --upgrade scikit-learn
pip install scikit-learn statsmodels
pip install --upgrade scikit-learn statsmodels
cat requirements.txt
pip install -r requirements.txt
Automating Data Jobs with cron
crontab -l # list scheduled jobs
man crontab
echo "* * * * * python create_model.py" | crontab