We provide IT Staff Augmentation Services!

Big Data Engineer Resume

Knoxville, TN


  • IT experience in Analysis, Design, Development and Big Data in Scala, Spark, Hadoop, Pig and HDFS environment and experience in Python.
  • Excellent technical and analytical skills with clear understanding of design goals and development for OLTP and dimensions modeling for OLAP.
  • Strong experience in building fully automated Continuous Integration & Continuous delivery pipelines and DevOps processors for agile store - based Applications in Retail and Transportations domain.
  • Firm understanding of Hadoop architecture and various components including HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce programming.
  • Have experience in data analytics, designing reports with visualization solutions using Tableau desktop and publishing on to the Tableau server.
  • Good Knowledge in Amazon Web Services (AWS) concepts likeEC2, S3, EMR, Elastic Cache, DynamoDB, Redshift, Aurora.
  • Experience in developing scripts using Python, Shell Scripting to do Extract, Load and Transform data working knowledge of AWS Redshift.
  • Involved in Software development, Data warehousing and Analytics and Data engineering projects using Hadoop, MapReduce, Pig, Hive and other open source tools/technologies.
  • Extensively used SQL, Numpy, Pandas, SparkML, Hive for Data Analysis and Model building.
  • Experience in analyzing, designing, and developing ETL Strategies and processes, writing ETL specifications.
  • Excellent understanding of NOSQL databases like HBASE, Cassandra, MongoDB.
  • Proficient knowledge and hand on experience in writing shell scripts in Linux.
  • Expertise in Hadoop Ecosystems tools which includingHDFS, YARN, MapReduce, Pig, Hive, Sqoop, Flume, Spark, Zookeeper and Oozie.
  • Developed multiple POC’s using PySpark, Scala and deployed on the YARN Cluster, compared the performance of Spark, with Hive and SQL.
  • Experienced in requirement analysis, application development, application migration and maintenance using Software Development Lifecycle (SDLC) and Python/Java technologies.
  • Adequate knowledge and working experience in Agile and Waterfall Methodologies.
  • Defining user stories and driving the agile board in JIRA during project execution, participating in sprint demo and retrospective.
  • Have good interpersonal, communication skills, strong problem-solving skills, explore/adopt to new technologies with ease and a good team member.


Hadoop Ecosystem: HDFS, SQL, YARN, PIG Latin, MapReduce, Hive, Sqoop, Spark, Yarn, Strom, Zookeeper, Oozie, Kafka, Storm, Flume

Programming Languages: Python, PySpark, JavaScript, Shell Scripting

Big Data Platforms: Hortonworks, Cloudera

AWS Platform: EC2, S3, EMR, Redshift, DynamoDB, Aurora, VPS, Glue, Kinesis, Boto3

Operating Systems: Linux, Windows, UNIX

Databases: Netezza, MySQL, UDB, HBase, MongoDB, Cassandra, Snowflake

Development Methods: Agile/Scrum, Waterfall

IDE’s: PyCharm, IntelliJ, Ambari

Data Visualization: Tableau, BO Reports, Splunk


Big Data Engineer

Confidential | Knoxville, TN


  • Design, develop and maintain non-production and production transformations in AWS environment and create the data pipelines using PySpark Programming.
  • Analyzed the SQL scripts to design and develop the solution to implement in PySpark.
  • Migrated data from on-prem Netezza data warehouse to Cloud AWS S3 buckets using data pipelines written PySpark.
  • Experience on Amazon EMR cluster for setting up Spark configuration and managing Spark jobs
  • Used Jupyter Notebook and Spark-Shell to develop, test and analyze Spark jobs before scheduling the customized Active Batch Jobs.
  • Performed data analysis and data quality check using Apache Spark Machine learning libraries in Python.
  • Implemented Amazon Redshift, Spectrum and Glue for the migration of the Fact and Dimensions tables to the Production environment.
  • Develop and test the SQL code in Spectrum for the time to execute the transform and the cost of the query scanned data per transform.
  • Written programs in Python for creating the External Tables in the Glue for the respective tables which are located in the S3 buckets to use it in the Amazon Spectrum.
  • Created Active Batch jobs to automate the Pyspark and SQL functions as daily run jobs.
  • Worked on developing SQL DDL to create, drop and alter the tables.
  • Maintain the project based code in Git-Hub repositories and use GIT for software development and version controlling.
  • Documented all the changes that are implemented in the RedShift SQL code using Confluence and Atlassian Jira, which includes the technical changes and data schema type changes.

Environment: Python, HDFS, PySpark, Yarn, Pandas, Numpy, SparkML, AWS S3, EMR, AWS Redshift, Spectrum, Glue, Netezza, Active Batch.

Data Engineer

Confidential | Dallas, TX


  • Design robust, reusable and scalable data driven solutions and data pipeline frameworks to automate the ingestion, processing and delivery of both structured and unstructured batch and real time data streaming data using Python Programming.
  • Worked with building data warehouse structures, and creating facts, dimensions, aggregate tables, by dimensional modeling, Star and Snowflake schemas.
  • Applied transformation on the data loaded into Spark Data Frames and done in memory data computation to generate the output response.
  • Good knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance.
  • Used Spark Data Frames API over Cloudera platforms to perform analytics on Hive data and used Spark Data Frame operations to perform required validations in the data.
  • Hands on experience on developing UDF, Data Frames and SQL queries in Spark SQL.
  • Creating and modified existing data ingestion pipelines using Kafka and Sqoop to ingest the database tables and streaming data into HDFS for analysis.
  • Built real-time streaming data pipelines with Kafka, Spark streaming and Cassandra.
  • Finalize the naming Standards for Data Elements and ETL jobs and create a Data dictionary for Meta Data Management.
  • Worked on developing ETL workflows on the data obtained using Python for processing it in HDFS and HBase using Flume.
  • Hands on experience in working with Continuous Integration and Deployment (CI/CD) using Jenkins, Docker.
  • Querying multiple databases like Snowflake, Netezza, UDB and MySQL for data processing.
  • Developing ETL pipelines in and out data warehouse using combination of Python and Snowflake SnowSQL. Writing SQL quires against Snowflake.
  • Good experience working on analysis tools like Tableau, Splunk for regression analysis, pie charts and bar graphs.

Environment: Python, HDFS, Spark, Kafka, Hive, Yarn, Cassandra, HBase, Jenkins, Docker, Tableau, Splunk, BO Reports, Netezza, UDB, MySQL, Snowflake, IBM Datastage.

Big Data Engineer/Hadoop Developer

Confidential | Austin, TX


  • Responsible for the design, implementation and architecture of very large-scale data intelligence solutions around big data platforms.
  • Analyzed large and critical datasets using HDFS, HBase, Hive, HQL, Pig, Sqoop and Zookeeper.
  • Developed multiple POC’s using Spark, Scala and deployed on the Yarn Cluster, compared the performance of Spark, with Hive and SQL.
  • Use Amazon Elastic Cloud Compute (EC2) infrastructure for computational tasks and Simple Storage Service (S3) as storage mechanism.
  • Capable of using AWS utilities such as EMR, S3 and Cloud Watch to run and monitor Hadoop and Spark jobs on AWS.
  • Maintain AWS Data pipeline as web service to process and move data between Amazon S3, Amazon EMR and Amazon RDS resources.
  • Worked on SQL queries in dimensional data warehouses and relational data warehouses. Performed Data Analysis and Data Profiling using Complex SQL queries on various systems.
  • Troubleshoot and resolve data processing issues and proactively engaged in data modelling discussions.
  • Worked on RDD Architecture and implementing spark operations on RDD and optimizing transformations and actions in Spark.
  • Written programs in Spark using Python, PySpark and Pandas packages for performance tuning, optimization and data quality validations.
  • Worked on developing KafkaProducers and Kafka Consumers for streaming millions of events per second on streaming data.
  • Implemented a distributing messaging queue to integrate with Cassandra using Apache Kafka.
  • Hands on experience on fetching the live stream data from UDB into HBase table using PySpark streaming and Apache Kafka.
  • Worked on Tableau to build customize interactive reports, worksheets, and dashboards.

Environment: HDFS, Python, SQL, Web Services, MapReduce, Spark, Kafka, Hive, Yarn, Pig, Flume, Zookeeper, Sqoop, UDB, Tableau, AWS, GitHub, Shell Scripting.

Big Data Engineer

Confidential | Austin, TX


  • Responsible for building scalable distributed data solution using Hadoop Cluster environment with Hortonworks distribution.
  • Convert raw data with sequence data format, such as Avro and Parquet to reduce data processing time and increase data transferring efficiency through the network.
  • Worked on building end to end data pipelines on Hadoop Data Platforms.
  • Worked on Normalization and De-normalization techniques for optimum performance in relational and dimensional databases environments.
  • Designed developed and tested Extract Transform Load (ETL) applications with different types of sources.
  • Creating files and tuned the SQL queries in Hive Utilizing HUE. Implemented MapReduce jobs in Hive by querying the available data.
  • Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, Data Frame, pair RDD’s.
  • Experience with PySpark for using Spark libraries by using Python scripting for data analysis.
  • Involved in converting HiveQL into Spark transformations using Spark RDD and through Scala programming.
  • Created User Defined Functions (UDF), User Defined Aggregated (UDA) Functions in Pig and Hive.
  • Worked on building custom ETL workflows using Spark/Hive to perform data cleaning and mapping.
  • Implemented Kafka Custom encoders for custom input format to load data into Kafka portions.
  • Support for the cluster, topics on the Kafka manager. Cloud formation scripting, security and resource automation.

Environment: Python, HDFS, MapReduce, Flume, Kafka, Zookeeper, Pig, Hive, HQL, HBase, Spark, Kafka, ETL, Web Services, Linux RedHat, Unix.

Hadoop Engineer/Developer



  • Designed and developed the applications on the data lake to transform the data according business users to perform analytics.
  • Responsible to manage data coming from different sources and involved in HDFS maintenance and loading of structured and unstructured data.
  • Worked on different files like csv, txt, fixed width to load data from various sources to raw tables.
  • Conducted data model reviews with team members and captured technical metadata through modelling tools.
  • Implemented ETL process wrote and optimized SQL queries to perform data extraction and merging from SQL server database.
  • Experience in loading logs from multiple sources into HDFS using Flume.
  • Worked with NoSQL databases like HBase in creating HBase tables to store large sets of semi-structured data coming from various data sources.
  • Involved in designing and developing tables in HBase and storing aggregated data from Hive tables.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDD’s and Scala.
  • Data cleaning, pre-processing and modelling using Spark and Python.
  • Strong Experience in writing SQL queries.
  • Responsible for triggering the jobs using the Control-M.

Environment: Python, SQL, ETL, Hadoop, HDFS, Spark, Scala, Kafka, HBase, MySQL, Netezza, Web Services, Shell Script, Control-M.

Software Engineer



  • Involved in preparing high level design documents, coding, analyzing business and enhancing my programming skills.
  • Developed Python automation scripts to facilitate quality testing.
  • Wrote Python modules to extract/load asset data from the MySQL source database.
  • Experience in using PL/SQL to write stored procedures, functions and triggers.
  • Worked with backed team to design, build and implement RESTFUL API’s for various services.
  • Analyzed business process workflows and assisted in the development of ETL procedures for mapping data from source to target systems.
  • Moving or copying the databases, detaching and attaching, backing and restoring databases.
  • Involved in resolving ETL Production issues, performed its recovery steps and implemented in bug fixes.
  • Performance tuning of complex SQL queries scheduled BI jobs according to the design flow using Control-M jobs.
  • Worked with Data Engineers to submit SQL statements, import and export data and generate reports in SQL server.

Environment: Python, ETL, MySQL, SOAP, SQL, Netezza, Web Sphere, Web Services, Shell Script, Control-M.

Hire Now