This time I needed some geographical data evaluation for one of my current projects. The data used to be opensource and as possible qualitative and complete. After spending some time in front of the famous search engine window there was a solution:
http://www.geonames.org/
That's the first sentence from the site:
The GeoNames geographical database covers all countries and contains over eight million placenames that are available for download free of charge.
Geonames is used by number of sites. Besides data it has also web services and code libraries in many programming languages to access them.
Direct web services are such a great idea, but an objection I've met for them was a question: "what happens when the service is unavaliable?". So I did just weight up pros and cons which could affect my current project.
Using Geonames web services
PROS
- no stress for the local system to perform db requests
- no need for data synchronization
CONS
- if the service gets unavailable or , it could make the rest of the app unusable
- we've got only data delivered by service
Using Geonames imported data
PROS
- possibility of the data availability is definitely high
- some data could be corrected locally or missing data could be added locally (I don't say that's nice)
CONS
- data amount is relatively high (about 1gb+) on average ext3 partition
- additional effort for data and deltas import
That's it, at the end of the day I've decided for much application stability and for local data assess all the more we have enough space on the server.
First lets start with database design, because the data we will get from the Geonames is a simple CSV format. Some approaches for MySQL can be found on geonames forums but in my case it has to be Postgres. Therefore doing it from scratch took a while to define resonable field definitions.
SQL:
DROP TABLE IF EXISTS iso_languagecodes CASCADE;
DROP TABLE IF EXISTS geoname CASCADE;
DROP TABLE IF EXISTS countryinfo CASCADE;
DROP TABLE IF EXISTS geoname_geonameid_key;
DROP TABLE IF EXISTS iso_languagecodes_iso_639_1_key;
DROP TABLE IF EXISTS countryinfo_key;
CREATE TABLE geoname (
geonameid INTEGER UNIQUE,
name VARCHAR(200),
asciiname VARCHAR(200),
alternatenames VARCHAR(5000),
latitude DECIMAL NOT NULL,
longtitude DECIMAL NOT NULL,
feature_class CHAR(1),
feature_code VARCHAR(10),
country_code CHAR(2),
cc2 VARCHAR(60),
admin1_code VARCHAR(20),
admin2_code VARCHAR(80),
admin3_code VARCHAR(20),
admin4_code VARCHAR(20),
population BIGINT,
elevation INTEGER,
gtopo30 INTEGER,
timezone VARCHAR(100),
modification_date DATE
);
CREATE TABLE iso_languagecodes (
iso_639_3 CHAR(3),
iso_639_2 VARCHAR(10),
iso_639_1 CHAR(2) UNIQUE,
language_name VARCHAR(200)
);
CREATE TABLE countryinfo (
iso CHAR(2),
iso3 CHAR(3),
iso_numeric INTEGER,
fips VARCHAR(3),
name VARCHAR(256),
capital VARCHAR(256),
area NUMERIC(32,8),
population INTEGER,
continent CHAR(2),
tld CHAR(8),
currencyCode CHAR(3),
currencyName CHAR(20),
Phone CHAR(32),
postalCodeFormat CHAR(128),
postalCodeRegex CHAR(256),
languages VARCHAR(256),
geonameId INTEGER UNIQUE,
neighbours CHAR(256),
equivalentFipsCode CHAR(16)
);
Now is time for the script which will do clean import of the current Geonames data. I would only remark here, that the way psql commands are working in this script needs local assess from trusted user to be configured in Postgres.
BASH:
#!/bin/sh
# base url we get files from
BASE_URL='http://download.geonames.org/export/dump/'
# define files we wanna to get
FILE_LIST="allCountries.zip readme.txt countryInfo.txt iso-languagecodes.txt"
# pg database definitions
DB_NAME=siteapp
DB_CMD="psql -h localhost -d $DB_NAME"
DB_IMPORT_STATEMENT="COPY %s FROM STDIN WITH DELIMITER AS E'\t' NULL AS ''"
# getting current date
TODAY_DATE=`date +%F`
#create download dir based on the current date
mkdir -p "$TODAY_DATE" && cd "$TODAY_DATE"
# download files
for FILE in $FILE_LIST
do
wget -c "$BASE_URL/$FILE"
done
# drop and create the table structure
$DB_CMD < ../pg_struct.sql
# unpack and import iso language codes into db
DATAFILE=iso-languagecodes.txt
LINE_COUNT=`cat $DATAFILE | wc -l`; let "LINE_COUNT=$LINE_COUNT-1"
tail -$LINE_COUNT $DATAFILE | tr -d '\r' | $DB_CMD \
-c"`printf \"$DB_IMPORT_STATEMENT\" iso_languagecodes`"
# unpack and import all data into db
DATAFILE=allCountries.txt
rm -fr $DATAFILE
unzip allCountries.zip $DATAFILE
cat $DATAFILE | tr -d '\r' | $DB_CMD -c"`printf \"$DB_IMPORT_STATEMENT\" geoname`"
rm $DATAFILE
# unpack and import country info into db
DATAFILE=countryInfo.txt
tail -248 $DATAFILE | tr -d '\r' | $DB_CMD -c"`printf \"$DB_IMPORT_STATEMENT\" countryinfo`"
cd ..
exit 0
The next script is needed to perform daily updates because Geonames issue updates daily. In fact this script should be run as a cronjob.
BASH:
#!/bin/sh
# make yesterday date
YESTERDAY=`date --date='1 day ago' +%F`
# define delta filenames
MODS_FILE="modifications-$YESTERDAY.txt"
DELS_FILE="deletes-$YESTERDAY.txt"
FILE_LIST="$MODS_FILE $DELS_FILE"
# the base url we get the data files from
BASE_URL='http://download.geonames.org/export/dump/'
# database options
DB_NAME=siteapp
DB_CMD="psql -h localhost -d $DB_NAME"
DB_IMPORT_STATEMENT="COPY %s FROM STDIN WITH DELIMITER AS E'\t' NULL AS ''"
# generate today date
TODAY_DATE=`date +%F`
# make dir for data download
mkdir -p "$TODAY_DATE" && cd "$TODAY_DATE"
# download required files
for FILE in $FILE_LIST
do
wget -c "$BASE_URL/$FILE"
done
# first delete deletes
cat $DELS_FILE | cut -f1 | while read ID
do
$DB_CMD -c"DELETE FROM geoname WHERE geonameid = $ID"
done
# first delete the changed records
cat $MODS_FILE | cut -f1 | while read ID
do
$DB_CMD -c"DELETE FROM geoname WHERE geonameid = $ID"
done
# then import all the changes into db
cat $MODS_FILE | tr -d '\r' | $DB_CMD -c"`printf \"$DB_IMPORT_STATEMENT\" geoname`"
cd ..
exit 0
Note that I'm not importing all the data files provided but the full database. If you need more data such timezones or alternative names available from Geonames just extend SQL and bash scripts.
Now enjoy and start using Geonames data in your project
Kind of Regards