We provide IT Staff Augmentation Services!

Data Engineer Resume

5.00/5 (Submit Your Rating)

Charlotte, NC

SUMMARY

  • 8+ years of experience as a Data Engineer and coding with analytical programming using Python. experience in AWS Cloud Engineer (Administrator) and working on AWS Services (EC2, EMR, LAMBDA, GLUE, S3, ATHENA, Dynamo DB) and Microsoft Azure.
  • Worked in various Linux server environments from DEV all the way to PROD and along with cloud powered strategies embracing Amazon Web Services (AWS).
  • Extensively worked on Spark using Python and Scala on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle/Snowflake
  • Hands on experience in developing SPARK applications using Spark tools like RDD transformations, Spark core, Spark MLlib, Spark Streaming and Spark SQL.
  • Excellent Working knowledge of Hadoop, Hive, Sqoop, pig, HBase & Oozie in real time environment and worked on many modules for performance improvements and architecture designing.
  • Expertise in deploying cloud - based services with Amazon Web Services (Databases, Migration,
  • Compute, IAM, Storage, Analytics, Network & Content Delivery, Lambda and Application Integration) and Microsoft Azure.
  • Good working experience on Spark (park streaming, spark SQL), Scala and Python.
  • Excellent Programming skills at a higher level of abstraction using Scala and Python.
  • Strong experience and knowledge of real time data analytics using Spark Streaming, Kafka, and Flume.
  • Experience of developing SQL scripts using Spark for handling different data sets and verifying the performance over Map Reduce jobs
  • Excellent understanding of best practices of Enterprise Data Warehouse and involved in Full life cycle development of Data Warehousing.
  • Experience in understanding the security requirements for Hadoop. Good knowledge on spark components like Spark SQL, MLib, Spark Streaming and GraphX.
  • Analyzed and tuned performance for spark jobs in EMR, understanding the type and size of the input processed using specific instance types.
  • Implemented Integration solutions for cloud platforms with Informatica Cloud.
  • Proficient in SQL, PL/SQL, and Python coding. Strong programming skills in Python and Scala to build efficient and robust data pipelines.
  • Working knowledge of Kafka for real-time data streaming and event-based architecture.
  • Experienced in ETL process by using varieties technologies (Flume, Kafka, Sqoop) from different data sources like Azure Blob storage, Azure event hub, Web log files, RDBMS.
  • Good exposure to Development, Testing, Implementation, Documentation and Production support.
  • Develop effective working relationships with client teams to understand and support requirements, develop tactical and strategic plans to implement technology solutions, and effectively manage client expectations.
  • An excellent team member with an ability to perform individually, good interpersonal relations, persuasive communication skills, hardworking and elevated level of motivation.

TECHNICAL SKILLS

Languages: Scala, SQL, UNIX shell script, JDBC, Python, Spark

Hadoop Ecosystem.: HDFS, Map Reduce YARN, Hive, Pig, HBase, Kafka, Zookeeper, Sqoop, Oozie, DataStax & Apache Cassandra, Drill, Flume, Spark, NIFI

Cloud: AWS EC2, AWS S3, AWS EMR, Azure Databricks, Azure Blob storage, Azure Virtual machine.

Web Technologies: JDBC, HTML5, DHTML and XML, CSS3, Web Services, WSDL.

Databases: Snowflake(cloud), Teradata, IBM DB2, Oracle, SQL Server, MySQL, NoSQL, Cassandra.

Data Warehousing: Informatica Powercenter/ Powermart/ Dataquality/Bigdata, Pentaho, ETL Development, Amazon Redshift, IDQ.

Version Control Tools: SVM, GitHub, Bitbucket.

BI Tools: Power BI, Tableau

Operating System: Windows, Linux, Unix, Macintosh HD.

PROFESSIONAL EXPERIENCE

Confidential - Charlotte NC

Data Engineer

Responsibilities:

  • Implemented Hadoop jobs on a EMR cluster performing several Spark, Hive & amp: MapReduce Jobs for processing data for building recommendation Engines, Transactional fraud analytics and Behavioral insights.
  • Populating the DataLake is done by leveraging Amazon S3 services interactions made possible through Amazon Cognito.
  • Parsing the data from S3 through the Python API calls through the Amazon API Gateway generating Batch Source for processing.
  • Good familiarity with AWS services like Dynamo DB, Redshift, Simple Storage Service (S3), and Amazon Elastic Search Services
  • Used Apache Airflow for complex workflow automation. The process automation is done by wrapper scripts through shell scripting
  • Scheduling Batch jobs through Amazon Batch performing Data processing jobs by leveraging Apache Spark APIs through Scala.
  • Performed PySpark jobs with the Spark core, SparkSQL libraries for processing the data.
  • Great familiarity with PySpark RDDs, Spark Actions & Transformations, PySpark DataFrames, PySpark-SQL Spark File formats.
  • Expertise knowledge about Hadoop & Spark High Level Architectures.
  • Hands-on experience in implementing and deploying (Elastic Map Reduce) EMR cluster leveraging Amazon Web Services with EC2 instances.
  • Developed Spark using Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Great knowledge about EMRFS, S3 bucketing, m3. xlarge, c3.4xlarge, IAM Roles & Cloud watch logs.
  • Deployment mode of the cluster was achieved through YARN scheduler and the size is Auto scalable.
  • Performed a POC to check the time taking for Change Data Capture (CDC) of oracle data across Striim, StreamSets and DBVisit
  • Responsible for creating multiple applications for reading the data from different Oracle instances to NIFI topics.
  • Responsible in maintaining and creating DAG’s using Apache Airflow
  • Expertise in using different file formats like Text files, CSV, Parquet, JSON
  • Experience in custom compute functions using Spark SQL and performed interactive querying.
  • Experience working on Vagrant boxes to setup a local NIFI and Stream Sets pipelines
  • Responsible for masking and encrypting the sensitive data on the fly
  • Responsible for setting up a MemSql cluster on Azure Virtual Machine Instance
  • Experience in Real time streaming the data using PySpark with NIFI.
  • Responsible for creating a NIFI cluster using multiple brokers.

Confidential, Denver CO

Big Data Analyst

Responsibilities:

  • Worked on batch processing of data sources using Apache Spark, Elastic search.
  • Designed and built the Reporting Application, which uses the Spark SQL to fetch and generate reports on HBase table data.
  • Extracted the needed data from the server into HDFS and Bulk Loaded the cleaned data into HBase.
  • Used different file formats like Text files, Sequence Files, Avro, Record Columnar CRC, ORC
  • Strong Experience in implementing Data warehouse solutions in Amazon web services (AWS) Redshift.
  • Worked on various projects to migrate data from on premise databases to AWS Redshift, RDS and S3.
  • Involved in ETL, Data Integration and Migration
  • Responsible for creating Hive UDF’s that helped spot market trends.
  • Optimizing Hadoop MapReduce code, Hive/Pig scripts for better scalability, reliability, and performance
  • Experience in storing the analyzed results back into the Cassandra cluster.
  • Developed custom aggregate functions using Spark SQL and performed interactive querying.
  • Prepared ETL design document which consists of the database structure, change data capture, Error handling, restart, and refresh strategies.
  • Worked with different feeds data like JSON, CSV, XML, DAT and implemented Data Lake concept.
  • Developed Informatica design mappings using various transformations.
  • Most of the infrastructure is on AWS.
  • Azure Databricks Distribution for Hadoop
  • Azure Blob storage for raw file storage
  • Azure Virtual machine for Kafka
  • Used Azure Functions to perform data validation, filtering, sorting, or other transformations for every data change in a DynamoDB table and load the transformed data to another data store.
  • Programmed ETL functions between Oracle and Amazon Redshift.
  • Maintained end to end ownership for analyzed data, developed framework’s, Implementation building and communication of a range of customer analytics projects.
  • Good exposure to IRI end-end analytics service engine, new big data platform (Hadoop loader framework, Big data Spark framework etc.)
  • Used Kafka producer to ingest the raw data into Kafka topics run the Spark Streaming app to process clickstream events.
  • Performed data analysis and predictive data modeling.
  • Involved in ingestion, transformation, manipulation, and computation of data using StreamSets, Spark with Scala.
  • Involved in data ingestion into MemSql using Flink pipelines for full load and Incremental load on variety of sources like web server, RDBMS, and Data API’s.
  • Worked on Spark Data sources, Spark Data frames, Spark SQL and Streaming using Scala.
  • Experience in developing Spark application using Scala SBT
  • Experience in integrating Spark-Memsql connector and JDBC connector to save the data processed in Spark to MemSql.
  • Used Flink Streaming for pipelined Flink engine to process data streams to deploy new API including definition of flexible windows, large scale Spark/Flink consumers
  • Expertise in using different file formats like Text files, CSV, Parquet, JSON
  • Experience in custom compute functions using Spark SQL and performed interactive querying.
  • Responsible for masking and encrypting the sensitive data on the fly
  • Responsible in maintaining and creating DAG’s using Apache Airflow
  • Responsible for setting up a MemSql cluster on Azure Virtual Machine Instance
  • Responsible for creating a Flink cluster using multiple brokers.

Confidential

Data Analyst

Responsibilities:

  • Understanding business needs, analyzing functional specifications and map those to develop and designing MapReduce programs and algorithms
  • Extracted the needed data from the server into HDFS and Bulk Loaded the cleaned data into HBase.
  • Used different file formats like Text files, Sequence Files, Avro, Record Columnar CRC, ORC
  • Strong Experience in implementing Data warehouse solutions in Amazon web services (AWS) Redshift.
  • Designed and implemented MapReduce-based large-scale parallel relation-learning system.
  • Customized Flume interceptors to encrypt and mask customer sensitive data as per requirement
  • Recommendations using Item Based Collaborative Filtering in Apache Spark.
  • Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi structured data coming from various sources
  • Built web portal using JavaScript, it makes a REST API call to the Elastic search and gets the row key.
  • Used Kibana, which is an open source-based browser analytics and search dashboard for Elastic Search.
  • Used Amazon web services (AWS) like EC2 and S3 for small data sets.
  • Performed importing data from various sources to the Cassandra cluster using Java APIs or Sqoop.
  • Developed iterative algorithms using Spark Streaming in Scala for near real-time dashboards.
  • Installed and configured Hadoop and Hadoop stack on a 40-node cluster.
  • Involved in customizing the partitioner in MapReduce to root Key value pairs from Mapper to
  • Reducers in XML format according to requirement.
  • Designed and built the Reporting Application, which uses the Spark SQL to fetch and generate reports on HBase table data.
  • Worked on batch processing of data sources using Apache Spark, Elastic search
  • Worked on various projects to migrate data from on premise databases to AWS Redshift, RDS and S3.
  • Involved in ETL, Data Integration and Migration
  • Responsible for creating Hive UDF’s that helped spot market trends.
  • Optimizing Hadoop MapReduce code, Hive/Pig scripts for better scalability, reliability, and performance
  • Experience in storing the analyzed results back into the Cassandra cluster.
  • Developed custom aggregate functions using Spark SQL and performed interactive querying.

We'd love your feedback!