Download and read non UTF-8 files through FTP with Ruby
Some FTP-servers are set up with Windows software like FileZilla. A common thing to do is to store CSV or text files on such servers to aid import-workflows. Due to such files originally coming from a Windows OS, you may run into encoding issues reading those files in a Linux or Unix environment. I don’t know much about encoding, but I understand that UTF-8 is somehow preferable when working with open source languages like Ruby, Python, and others.
I recently experienced this problem. I was trying to download a non utf-8 file from an FTP server and read this file from a Ruby on Rails application. As it was not straightforward, I wrote this post to show how I solved it.
Logging into the FTP-server with authentication
require 'net/ftp'
#Auth example:
server = 'lean.bean.server.com'
username = 'leanbean'
password = 'mexican-party-secret'
#Set up the connection
ftp = Net::FTP.new(server,username,password)
Tip: After a successful connection is established, you can navigate through the folder structure. You do this by typing the name of your connection variable followed by commands like pwd, ls, chdir and more. E.g.
ftp.chdir 'ImportantFolder'
ftp.pwd
Download txt file, save it as a .csv file to a distination folder and close connection:
#Make sure to use the binary download option 'getbinaryfile'
ftp.getbinaryfile 'guestlist.txt','destination/guestlist.csv', 1024
ftp.close
Read with Ruby: “invalid byte sequence in UTF-8 (ArgumentError)”"`=~’: invalid byte sequence in UTF-8 (ArgumentError)
file = "destination/guestlist.txt"
options = {encoding: "utf-8", headers_in_file: true, col_sep: "\t"}
guests = SmarterCSV.process(file,options)
puts guests
#=> /Users/partyhost/.rvm/gems/ruby-2.2.0/gems/smarter_csv-1.2.3/lib/smarter_csv/smarter_csv.rb:133:in =~': invalid byte sequence in UTF-8 (ArgumentError) from /Users/partyhost/.rvm/gems/ruby-2.2.0/gems/smarter_csv-1.2.3/lib/smarter_csv/smarter_csv.rb:133:in
process’ from script/myscript.rb:9:in `<top (required)>'
Bummer! Let’s check the file encoding:
#OSX method
file -I guestlist.csv
#=> text/plain; charset=unknown-8bit
This is our problem, the 'unknown-8bit' encoded file.
Change file encoding to utf-8 by simply copying the file with ‘cat -v’
cat -v guestlist.csv > fixed_guestlist.csv
file -I guestlist.csv
#=> text/plain; charset=us-ascii
This fixed the encoding. However, we’re not 100% ready to read our file yet. The cat -v added “^M” at the end of each line in our file. This is a new line operator we need to get rid of.
Removing funky characters from end of line, ^M
filepath = "guestlist.csv"
text = File.read(filepath)
replace = text.gsub("^M", "\n")
File.open(filepath, "w") {|file| file.puts replace}
Bingo! We should now have a formatted file, encoded in UTF-8 ready to go for further work.
Just adding a line to test incremental builds.