We provide IT Staff Augmentation Services!

Bigdata / Cloud Data Engineer Resume

2.00/5 (Submit Your Rating)

Austin, TX

SUMMARY

  • 8.8 Years of experience in IT Industry in the Big data platform having extensive hands on experience in Apache Hadoop ecosystem and enterprise application development.
  • Experience in Hadoop ecosystem experience in ingestion, storage, querying, processing and analysis of big data.
  • Hands on experience on Data Analytics Services such as Athena, Glue Data Catalog & Quick Sight
  • Performed the migration of Hive and MapReduce Jobs from on - premiseMapR to AWS cloud using EMR and Qubole
  • Experience in installation, configuration, supporting and managing Hadoop Clusters using HDP and other distributions
  • Experience in analyzing data using Hive QL, custom UDFs and MapReduce programs in Python.
  • Good Understanding of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce concepts.
  • Hands on expertise with AWS Databases such as RDS (Aurora), Redshift, DynamoDB and Elastic Cache (Memcached & Redis)
  • Familiar with data architecture includingdata ingestion pipeline design, Hadoop information architecture, data modeling and data mining, machine learning andadvanced data processing. Experiencein optimizingETL workflows.
  • Hands-on experience withAmazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQSand other services of the AWS family.
  • Scheduled Airflow DAGs to run multiple Hive and Pig jobs, which independently run with time and data availability
  • Experience developing Airflow workflows for scheduling and orchestrating the ETL process
  • Having experience in developing a data pipeline using Kafka to store data into HDFS.
  • Strong work experience on Kafka streaming to fetch the data real time or near real time.
  • Hands on experience on tools like Hive for data analysis, Sqoop for data ingestion, Oozie for scheduling and Zookeeper for coordinating cluster resources
  • Created modules for spark streaming to ingest data into Data Lake.
  • Experience in Dimensional Data Modeling Star Schema, Snow-Flake Schema, Fact and Dimensional Tables, concepts like Lambda Architecture, and Batch processing, Oozie.
  • Written the Kafka-Spark Streaming module acting as consumer to Kafka which executes the business logic on the trades using spark DStreams and RDD methods.
  • Worked on Scala code base related to Apache Spark performing the Actions, Transformations on RDDs, Data Frames & Datasets using SparkSQL and Spark Streaming Contexts
  • Proficiency in analyzing large unstructured data sets using Spark, Scala and deploying on the Yarn cluster
  • Experienced in developing MapReduce programs using Apache Hadoop for working with Big Data.
  • Good knowledge on extracting the models and trends from the raw data collaborating with the data science team.
  • Good understanding of Apache Spark High level architecture and performance tuning pattern
  • Parsing the data from S3 through the Python API calls through the Amazon API Gateway generating Batch Source for processing
  • Good understanding of AWS SageMaker
  • Extract, transform and load the data from different formats like JSON, a Database, and expose it for ad-hoc/interactive queries using Spark SQL
  • Participated in the full software development lifecycle with requirements, solution design, development,QAimplementation, and product support using Scrum and other Agile methodologies

TECHNICAL SKILLS

Big Data Ecosystem: HDFS, MapReduce, Hive, YARN, Impala, Sqoop, Flume, Oozie, Zookeeper, Spark, Scala, Storm, Kafka, Spark SQL, Azure SQL

Databases: Oracle, SQL Server, MySQL, HBase, MongoDB, RedShift, DynamoDB

Data Visualization Tools: Cognos, Tableau

AWS Tools: AWS Sage Maker, AWS Glue, AWS Athena

Cloud: AWS, Azure

Programming Languages: Python, Scala, Shell scripting, PL/SQL

Operating System: Linux, Unix, Windows

Integration Tools: Git, Gerrit, Jenkins

PROFESSIONAL EXPERIENCE

Confidential, Austin, TX

BigData / Cloud Data Engineer

Responsibilities:

  • FollowedAgilemethodologies and implemented them on various projects by setting up Sprint for every two weeks and daily stand-up meetings.
  • Involved in designing and deploying multi-tier applications using all the AWS services like (EC2, Route53, S3, RDS, Dynamo DB, SNS, SQS, IAM) focusing on high-availability, fault tolerance, and auto-scaling in AWS Cloud Formation
  • Developed Scala applications using Dataframes/SQL/Datasets and RDD’s in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop
  • Supporting Continuous storage in AWS using Elastic Block Storage, S3, Glacier. Created Volumes and configured Snapshots for EC2 instances.
  • Used Kafka and Kafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases.
  • Used Data Frame API in Scala for converting the distributed collection of data organized into named columns, developing predictive analytic using Apache Spark Scala APIs
  • Developed Hive queries to pre-process the data required for running the business process
  • Created HBase tables to load large sets of structured, semi-structured and unstructured data coming from UNIX, NoSQL and a variety of portfolios.
  • Integratedvisualizationsinto a Spark application using Databricks and popular visualization libraries (ggplot,matplotlib).
  • Extracted data from HDFS using Hive, Presto and performed data analysis using Spark with Scala, PySpark and feature selection and created nonparametric models in Spark
  • Developed and Configured Kafka brokers to pipeline server logs data into spark streaming.
  • Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for analysis.
  • Involved in writing pyspark User Defined Functions (UDF’s) for various use cases and applied business logic wherever necessary in the ETL process.
  • Designed the data aggregations on Hive for ETL processing on Amazon EMR to process data as per business requirement
  • Developed Python scripts to automate the ETL process using Apache Airflow and CRON scripts in the Unix operating system as well.
  • Implemented Apache Airflow for authoring, scheduling and monitoring Data Pipelines
  • Implementations of generalized solution model using AWS SageMaker
  • UtilizedSparkSQL API inPySparkto extract and load data and perform SQL queries.
  • Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
  • Worked with Apache Airflow and Genie to automate job on EMR.
  • Programmed in Hive, Spark SQLand Python to streamline the incoming data and build the data pipelines to get the useful insights, and orchestrated pipelines
  • Involved in using the core Spark APIs and processing data on a EMR cluster
  • Used ETL pipeline to source these tables and to deliver this calculated ratio data from AWS to Datamart (SQL Server) & Credit Edge server
  • Involved in tuning relational databases (e.g. Microsoft SQL Server, Oracle, MySQL) and columnar databases (e.g. Amazon Redshift, Microsoft SQL Data Warehouse)

Environment: Hortonworks, Hadoop, HDFS, AWS Glue, AWS Athena, Python, EMR, Sqoop, Hive, NoSQL, HBase, Shell Scripting, Scala, Spark, SparkSQL, AWS, SQL Server, Tableau.

Confidential, Austin, TX

BigData Engineer

Responsibilities:

  • Developed and implemented Software Release Management strategies for various applications according to the agile process.
  • Responsible for building scalable distributed data solutions using EMR cluster environment with Amazon EMR 5.6.1.
  • Worked on Kafka REST API to collect and load the data on Hadoop file system and also used sqoop to load the data from relational databases.
  • Extract Real time feed using Kafka and Spark Streaming and process data using DataFrames and save the data as Parquet format in HDFS.
  • Used Spark-Streaming APIs to perform necessary transformations and actions on the data rendered from Kafka and Persists into HDFS.
  • Developed Spark scripts by writing custom RDDs in Scala for data transformations and perform actions on RDDs.
  • Worked with Avro, Parque, ORC file formats and compression techniques like LZO.
  • Used Hive to form an abstraction on top of structured data residing in HDFS and implemented Partitions, Dynamic Partitions, Buckets on HIVE tables.
  • Install and configure Airflowfor S3 bucket and Snowflake data warehouse and createddagsto run the Airflow.
  • Used Spark APIs over Hadoop YARN as execution engine for data analytics.
  • Performed advanced procedures like text analytics and processing, using the in-memory computing capabilities of Spark using Scala.
  • Created Airflow Scheduling scripts in Python.
  • Worked on migrating MapReduce programs into Spark transformations using Scala.
  • Designed, developed data integration programs in a Hadoop environment with NoSQL data store Cassandra for data access and analysis.
  • Involved inmoving the raw databetween different systems using Apache Nifi.
  • Wrote mapreduce code using pythonin order to get rid of certain security issues in the data
  • Worked on creating theRDD's,DF's for the required input data and performed the data transformations using PySpark.
  • Used Job management scheduler apache Oozie to execute the workflow.
  • DesignedAWSCloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.
  • Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup onAWS.
  • Created data pipelines for different events to load the data from DynamoDB to AWS S3 bucket and then into HDFS location
  • Used Ambari to monitor node's health and status of the jobs in Hadoop clusters.
  • Worked on Tableau to build customized interactive reports, worksheets and dashboards.
  • Implemented Kerberos for strong authentication to provide data security.
  • Implemented LDAP and Active directory for Hadoop clusters
  • Worked on apache Solr for indexing and load balanced querying to search for specific data in larger datasets.
  • Involved in performance tuning of Spark jobs using Cache and using complete advantage of cluster environment.

Environment: AWS- S3, EMR, Lambda, Python, CloudWatch, Amazon Redshift, Spark-Java, Spark- Scala, Athena, Hive, HDFS, Spark, Scala, Oozie, Bitbucket Github.

Confidential, Dallas, TX

Data Engineer

Responsibilities:

  • Developed Talend Bigdata jobs to load heavy volume of data into S3 data lake and then into Snowflake.
  • Developed snowpipes for continuous injection of data using event handler from AWS (S3 bucket).
  • Developed SnowSql scripts to deploy new objects and update changes into Snowflake.
  • Developed a Python script to integrate DDL changes between on-prem Talend warehouse and snowflake.
  • Working with AWS stack S3, EC2, Snowball, EMR, Athena, Glue, Redshift, DynamoDB, RDS, Aurora, IAM, Firehose, and Lambda.
  • Designing and implementing new HIVE tables, views, schema and storing data optimally.
  • Performing Sqoop jobs to land data on HDFS and running validations.
  • Querying data by optimizing the query and increasing the query performance.
  • Designing and creating SQL Server tables, views, stored procedures, and functions.
  • Performing ETL operations using Apache Spark, also using Ad-Hoc queries and implementing Machine Learning techniques.
  • Worked on configuring CICD for CaaS deployments (k8's).
  • Involved in migrating master-data form Hadoop to AWS.
  • Worked with Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, Data Frames, Pair RDD's.
  • Developed preprocessing job using Spark Data frames to transform JSON documents to flat file
  • Loaded D-Stream data into Spark RDD and did in-memory data computation to generate output response
  • Processing with Amazon EMR big data across a Hadoop cluster of virtual servers on AmazonElasticCompute Cloud (EC2) andAmazonSimple Storage Service (S3).
  • Imported data from AWS S3 into Spark RDD, performed transformations and actions on RDD's.
  • Worked on Big Data infrastructure for batch processing and real-time processing using Apache Spark
  • Developed Apache Spark applications by using Scala for data processing from various streaming sources
  • Processed the Web server logs by developing Multi-Hop Flume agents by using Avro Sink and loaded into Cassandra for further analysis, Extracted files from Cassandra through Flume
  • Responsible for design and development of Spark SQL Scripts based on Functional Specifications
  • Worked on the large-scale Hadoop YARN cluster for distributed data processing and analysis using Spark, Hive, and Cassandra
  • Involved in converting Cassandra/Hive/SQL queries into Spark transformations using RDD's and Scala
  • Implemented Spark Scripts using Scala, Spark SQL to access hive tables to spark for faster processing of data.
  • Developed Some Helper class for abstracting Cassandra cluster connection act as core toolkit
  • Involved in creating Data Lake by extracting customer's data from various data sources to HDFS which include data from Excel, databases, and log data from servers
  • Moved data from HDFS to Cassandra using Map Reduce and Bulk Output Format class.
  • Extracted files from Cassandra through Sqoop and placed in HDFS and processed it using Hive
  • Writing MapReduce (Hadoop) programs to convert text files into AVRO and loading into Hive table
  • Experienced in writing real-time processing and core jobs using Spark Streaming with Kafka as a data pipeline system
  • Extending HIVE/PIG core functionality by using custom User Defined Function's (UDF), User Defined Table-Generating Functions (UDTF) and User Defined Aggregating Functions (UDAF) for Hive and Pig
  • Involved in loading data from rest endpoints to Kafka producers and transferring the data to Kafkabrokers
  • Used Apache Kafka functionalities like distribution, partition, replicated commit log service for messaging
  • Partitioning Data streams using Kafka. Designed and configured Kafka cluster to accommodate heavy throughput.
  • Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team
  • Migrated an existing on-premises application to Amazon Web Services (AWS) and used its services like EC2 and S3 for small data sets processing and storage, experienced in maintaining the Hadoop cluster on AWS EMR
  • Developed solutions to pre-process large sets of structured, semi-structured data, with different file formats like Text, Avro, Sequence, XML, JSON, and Parquet
  • Generated various kinds of reports using Pentaho and Tableau based on Client specification
  • Have come across new tools like Jenkins, Chef and Rabbit MQ.
  • Worked with SCRUM team in delivering agreed user stories on time for every Sprint

Environment: Snowflake, SnowSQL, Hadoop, MapReduce, HDFS, Yarn, Hive, Sqoop, Spark, Scala, AWS, EC2, S3, EMR, Cassandra, Flume, Kafka, Pig, Linux, Shell Scripting.

Confidential, Dallas, TX

Data Engineer

Responsibilities:

  • Worked on Snowflake Shared Technology Environment for providing stable infrastructure, secured environment, reusable generic frameworks, robust design architecture, technology expertise, best practices and automated SCBD (Secured Database Connections, Code Review, Build Process, Deployment Process) utilities.
  • Designed ETL process using Pentaho Tool to load from Sources to Targets with Transformations.
  • Worked on Snowflake Schemas and Data Warehousing.
  • Developed PentahoBigdata jobs to load heavy volume of data into S3 data lake and then into Redshift data warehouse.
  • Migrated the data from Redshift data warehouse to Snowflakedatabase.
  • Build dimensional modelling, data vault architecture on Snowflake.
  • Built scalable distributed Hadoop cluster running Hortonworks Data Platform (HDP 2.6)
  • Involved in developing Spark code using Scala and Spark-SQL for faster testing and processing of data and exploring of optimizing it using SparkContext, Spark-SQL, PairRDD's
  • Serializing JSON data and storing the data into tables using Spark SQL
  • Spark Streaming collects data from Kafka in near-real-time and performs necessary transformations and aggregation to build the common learner data model and stores the data in NoSQL store (HBase).
  • Worked on Spark framework on both batch and real-time data processing
  • Hands on experience in MLlib from Spark are used for predictive intelligence, customer segmentation and for smooth maintenance in Spark streaming
  • Developing programs for Spark streaming which takes the data from Kafka and pushes into different sources
  • Loading the data from the different Data sources like (Teradata, DB2, Oracle and flat files) into HDFS using Sqoop and load into Hive tables, which are partitioned.
  • Created different pig scripts& converted them as shell command to provide aliases for common operation for project business flow.
  • Implemented Partitioning, Bucketing in Hive for better organization of the data.
  • Created few Hive UDF's to as well to hide or abstract complex repetitive rules.
  • Developed Oozie Workflows for daily incremental loads, which gets data from Teradata and then imported into hive tables.
  • Developed bash scripts to bring log files from FTP server and then processing it to load into Hive tables.
  • All the bash scripts are scheduled using Resource Manager Scheduler.
  • Developed Map Reduce programs for applying business rules on the data.
  • Developed a NiFi Workflow to pick up the data from Data Lake as well as from server and send that to Kafka broker
  • Involved in loading and transforming large sets of structured data from router location to EDW using an Apache NiFi data pipeline flow
  • Implemented Kafka event log producer to produce the logs into Kafka topic which are utilized by ELK (Elastic Search, Log Stash, Kibana) stack to analyze the logs produced by the Hadoop cluster
  • Did Implementation using Apache Kafka replacement for a more traditional message broker (JMS Solace) to reduce licensing and decouple processing from data producers, to buffer unprocessed messages.
  • Implemented receiver-based approach, here I worked on Spark streaming for linking with Streaming Context using Python and handle proper closing & waiting stages as well.
  • Experience in Implementing Rack Topology scripts to theHadoopCluster.
  • Implemented the part to resolve issues related with old Hazel cast API Entry Processor.
  • Used Akka Toolkit to perform few builds and used Akka with Scala
  • Excellent knowledge with Talend Administration console, Talend installation, using Context and global map variables in Talend
  • Used dashboard tools like Tableau
  • Used Talend Admin Console Job conductor to schedule ETL Jobs on daily, weekly basis

Environment: HadoopHDP, Linux, MapReduce, HBase, HDFS, Hive, Pig, Tableau, NoSQL, Shell Scripting, Sqoop,Open source technologies Apache Kafka, Apache Spark, Git, Talend.

Confidential

SQL DEVELOPER / DATA ENGINEER

Responsibilities:

  • Worked on different dataflow and control flow task, for loop container, sequence container, script task, executes SQL task and Package configuration.
  • Created new procedures to handle complex logic for business and modified already existing stored procedures, functions, views and tables for new enhancements of the project and to resolve the existing defects.
  • Loading data from various sources like OLEDB, flat files to SQL Server 2012 database Using SSIS Packages and created data mappings to load the data from source to destination.
  • Created batch jobs and configuration files to create automated process using SSIS.
  • Created SSIS packages to pull data from SQL Server and exported to Excel Spreadsheets and vice versa.
  • Built SSIS packages, to fetch file from remote location like FTP and SFTP, decrypt it, transform it, mart it to data warehouse and provide proper error handling and alerting
  • Extensive use of Expressions, Variables, Row Count in SSIS packages
  • Data validation and cleansing of staged input records was performed before loading into Data Warehouse
  • Automated the process of extracting the various files like flat/excel files from various sources like FTP and SFTP (Secure FTP).
  • Deploying and scheduling reports using SSRS to generate daily, weekly, monthly and quarterly reports.

ENVIRONMENT: MS SQL Server 2008/2012, SQL Server Business Intelligence Development Studio, SSIS-2008, SSRS-2008, Report Builder, Office, Excel, Flat Files, .NET, T-SQL.

Confidential

SQL Server Developer

Responsibilities:

  • Involved in Installation and configuring of SQL Server 2005
  • Created database objects such astables, indexes and designing constraints for data integrity.
  • Created complex stored procedures, functions, triggers, and other database objects for applications using T-SQL
  • Added additional functionality to the specified business logic as per customer demand.
  • Created linked servers to connect various OLEDB servers and providers.
  • Extensively involved in designing the SSIS packages to export data of flat file source to SQL Server database.
  • Automated SSIS Packages by creating jobs and also created alerts.
  • Identified and defined the report datasets.
  • Enhancing and deploying the SSIS Packages from development server to production server.
  • Wrote stored procedures for report generation in SQL Server2005.
  • Created Sub-Reports, Drilldown-Reports, Summary Reports, and Parameterized Reports in SSRS

Environment: SQL Server 2000/2005, SSIS, SSRS, Crystal Reports, Microsoft Excel

We'd love your feedback!