Moving big data to the cloud has many advantages. The cloud enables you to run business intelligence in extra-large datasets, getting valuable insights. You can also make the data available for collaboration or store it to ensure availability in case of failure or disaster.
Before taking advantage of the benefits of the cloud you need to be able to upload these large datasets. This article explains the challenges of uploading big data to the cloud, and offers some possible solutions.
Challenges of Uploading Big Data to the Cloud
The main problem of big data is its size. The amount of data generated every day is constantly growing. IDC predicts that worldwide data will reach 175 zettabytes by 2025. A zettabyte equals a little over 1,000 exabytes, which equals 1,000 petabytes. Worldwide data volume will continue growing because companies are constantly gathering data from users.
When you try to upload massive datasets you can encounter issues such as long transfer time. The connection speed plays a key role in data transmission. High speed connections are usually not enough for uploading large datasets. Uploading one terabyte of data in an 800 Mbps speed would take you almost three hours.
Big Data Uploading Solutions
Solving issues such as long transfer times is crucial. Major public cloud providers offer solutions such as dedicated connectivity for data ingestion. Other options may include compressing the files or physically shipping the disks to be copied to cloud storage.
Cloud platforms offer customers the option of purchasing dedicated connectivity from the user’s data center to one of the cloud vendor’s locations. This enables cloud users to transfer data using a different connection instead of the public Internet, usually at a higher speed. Dedicated connectivity through a cloud vendor is useful when the data is located in a single data center and the destination is a fixed cloud location. It is also a good fit for cold archives.
The downside of this method is that it can be expensive if you don’t utilize it fully. Users pay for provisioned bandwidth. This usually requires also purchasing terminal equipment to connect the data center to the cloud vendor’s Point of Presence (POP). Dedicated connectivity can also cause vendor lock-in.
Multi-part upload tools
You can use a multi-part upload tool. This type of tool separates large files into multiple parts, then sends the various parts in parallel and assembles them back at destination. Using a multi-part upload can improve your uploading speed. However, this approach aggregates the throughput of separated TCP streams, and can lead to high latency.
Managed File Transfer Protocol
You can use a Managed File Transfer Protocol (MFT) service to transfer large files, instead of the standard File Transfer Protocol (FTP). MFT is a technology service that offers secure sharing of high-volume data through a network.
Traditional ad-hoc protocols, such as FTP, enable moving large volumes of data between any two entities. However, this type of communication protocol does not provide built-in security. Increased compliance requirements are driving companies to opt for a managed service to increase the security of data transfers. A managed service provides a higher level of control and visibility over the data, which can be useful when transferring sensitive data.
MFT transfer files over public and private networks by supporting multiple file transfer protocols such as HTTP/S and FTP/S. It offers security features such as encrypting the data in transit and at rest and user authentication.
You can accelerate the rate of transferring files by compressing the files before uploading them. If you have multiple large files, such as media files, you can use a file uploader. Many file uploaders include compression tools that enable you to compress multiple files.
Some file upload solutions provide compression, upload, and storage all in one. You can use a file uploader when you work with data from different sources. In addition, if you need to regularly upload images to a cloud repository, a file uploader can help you automate the process.
Some Internet service providers moderate upload bandwidth by using broadband traffic management. A Virtual Private Network (VPN) enables you to create a secure connection to another network over the Internet.
Using a VPN can override your Internet Service Provider (ISP) file size check, by preventing the ISP from reading the size of the file. This method may be useful for smaller data science teams and budding projects.
When data is still too large to upload by any of the other methods, there may be no other way than physically shipping the disks for copying at the destination. You copy the data to a storage device and ship it to the data center, where it is copied to cloud storage.
Copying the data to a physical drive eliminates the middle layer of the public Internet, using direct upload. This makes it faster to upload the data to cloud storage. The effectiveness of this method depends on the type of drive. For example, Solid State Drives (SSD) are faster than Hard Disk Drives (HDD) since they don’t have any moving parts.
When you need to upload large datasets to the cloud you want to avoid delays and long transfer time. Available options include uploading it directly using a provisioned or private connection. You can also compress the files before uploading.
When dealing with extremely large data, sometimes the only way is to copy them to a physical device and shipping it to the new storage provider. Which technique is best? That depends on the size of your data and your cost requirements.
Gilad David Maayan is a technology writer who has worked with over 150 technology companies including SAP, Samsung NEXT, NetApp and Imperva, producing technical and thought leadership content that elucidates technical solutions for developers and IT leadership.