This article was originally published by TeamSQL. Thank you for supporting the partners who make SitePoint possible.
Importing a large amount of data into Redshift is easy using the COPY command. To demonstrate this, we’ll import the publicly available dataset “Twitter Data for Sentiment Analysis” (see Sentiment140 for additional information).
Note: You can connect to AWS Redshift with TeamSQL, a multi-platform DB client that works with Redshift, PostgreSQL, MySQL & Microsoft SQL Server and runs on Mac, Linux and Windows. You can download TeamSQL for free.
Download the ZIP file containing the training data here.
The Redshift Cluster
For the purposes of this example, the Redshift Cluster’s configuration specifications are as follows:
- Cluster Type: Single Node
- Node Type: dc1.large
- Zone: us-east-1a
Create a Database in Redshift
Run the following command to create a new database in your cluster:
CREATE DATABASE sentiment;
Create a Schema in the Sentiment Database
Run the following command to create a scheme within your newly-created database:
CREATE SCHEMA tweets;
The Schema (Structure) of the Training Data
The CSV file contains the Twitter data with all emoticons removed. There are six columns:
- The polarity of the tweet (key: 0 = negative, 2 = neutral, 4 = positive)
- The id of the tweet (ex. 2087)
- The date of the tweet (ex. Sat May 16 23:58:44 UTC 2009)
- The query (ex. lyx). If there is no query, then this value is NO_QUERY.
- The user that tweeted (ex. robotickilldozr)
- The text of the tweet (ex. Lyx is cool)
Create a Table for Training Data
Begin by creating a table in your database to hold the training data. You can use the following command:
CREATE TABLE tweets.training (
polarity int,
id BIGINT,
date_of_tweet varchar,
query varchar,
user_id varchar,
tweet varchar(max)
)
Uploading CSV File to S3
To use Redshift’s COPY command, you must upload your data source (if it’s a file) to S3.
To upload the CSV file to S3:
- Unzip the file you downloaded. You’ll see 2 CSV files: one is test data (used to show structure of original dataset), and the other (file name: training.1600000.processed.noemoticon) contains the original data. We will upload and use the latter file.
- Compress the file. If you’re using macOS or Linux, you can compress the file using GZIP by running the following command in Terminal:
gzip training.1600000.processed.noemoticon.csv
- Upload your file using the AWS S3 Dashboard.
Alternatively, you can use Terminal/Command Line to upload your file. To do this, you must install AWS CLI and, after installation, configure it (run aws configure
in your terminal to start the configuration wizard) with your access and secret key.
Connect TeamSQL to the Redshift Cluster and Create the Schema
Open TeamSQL (if you don’t have the TeamSQL Client, download it from teamsql.io) and add a new connection.
- Click Create a Connection to launch the Add Connection window.
- Select Redshift and provide the requested details to set up your new connection.
- Do not forget to enter the Default Database Name!
- Test the connection, and save if the test is successful.
- By default, TeamSQL displays the connections you’ve added in the left-hand navigation panel. To enable the connection, click on the socket icon.
- Right click on default database to open a new tab.
- Run this command to create a new schema in your database.
CREATE SCHEMA tweets;
- Refresh the database list in the left-hand navigation panel with right clicking on connection item.
- Create a new table for training data.
CREATE TABLE tweets.training (
polarity int,
id int,
date_of_tweet varchar,
query varchar,
user_id varchar,
tweet varchar
)
- Refresh the connection and your table should appear in the left-hand list.
Using the COPY Command to Import Data
To copy your data from your source file to your data table, run the following command:
Continue reading %Import Data into Redshift Using the COPY Command%
by Can Abacıgil via SitePoint
No comments:
Post a Comment