We provide IT Staff Augmentation Services!

Data Engineer Resume

2.00/5 (Submit Your Rating)

New York, NY

PROFESSIONAL SUMMARY:

  • Self - motivated data engineer with solid foundational skills and proven tracks of implementation in various data platforms.
  • Experienced in writing sub queries, stored procedures, triggers, cursors, functions and window functions using SQL.
  • Good Understanding of Hadoop Architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node, Data Node and MapReduce concepts.
  • Experience in working with Amazon Web Services (AWS), using AWS EC2 for computing, Redshift Spectrum, Glue and AWS S3 as storage mechanism.
  • Strong working knowledge of Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
  • Created modules for spark streaming in data into Data Lake using Spark.
  • Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data.
  • Experience in analyzing/manipulating huge and complex data sets and finding insightful patterns and trends within structured, semi-structured and unstructured data.
  • Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX, and Spark Streaming for processing and transforming complex data using in-memory computing capabilities.
  • Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Data Frame, Pair RDD's and Spark YARN.
  • Solid understanding of dimensional modeling, STAR schema design, Snow flake schema design, slowly changing dimensions and confirmed dimensions.
  • Proficient in Hive Query language and experienced in hive performance optimization using Static-Partitioning, Dynamic-Partitioning, bucketing and Parallel Execution concepts.
  • As Data Engineer designed and maintained high performance ELT/ETL processes.
  • Experience in analyzing data using Hive QL, Pig Latin, and custom MapReduce programs in Java, custom UDF’s.
  • Good work experience with UNIX/Linux commands, scripting and deploying the applications on the servers.
  • Strong skills in algorithms, data structures, Object oriented design, Design patterns, documentation and QA/testing.
  • Experienced in working as part of fast paced Agile Teams, exposure to testing in scrum teams, Test-Driven development.

TECHNICAL SKILLS:

Programming Language: Python, Shell Scripting, VBA, Scala, Core Java

Hadoop/Big Data Stack: Hadoop, HDFS, MapReduce, Hive, Pig, Spark, PySpark, Scala, Kafka, Zookeeper, HBase, Sqoop, Flume, Oozie, Hue, Nifi.

Hadoop Distributions: Cloud Technologies

Horton works, Cloudera: AWS(S3, IAM, EC2, EMR, Cloud Watch, Dynamo DB,Redshift), Snowflake

ETL Tools: IBM DataStage, Monarch, ACL

Reporting Tools: Business Objects, Tableau

Query Languages: Hive QL, Impala, SQL, Pig, SnowSQL

Databases: IBM DB2, Netezza, Oracle, SQL Server, Teradata, Cassandra, Snowflake

Operating Systems: Windows, Linux

Version Control system: GIT, UCD, Ansible

PROFESSIONAL EXPERIENCE:

Confidential, New York, NY

Data Engineer

Responsibilities:

  • Maintained and developed complex SQL queries, views, functions and reports that qualify customer requirements on Snowflake.
  • Performed analysis, auditing, forecasting, programming, research, report generation, and software integration for an expert understanding of the current end-to-end BI platform architecture to support the deployed solution
  • Advanced and developed test plans to ensure successful delivery of a project. Employed performance analytics predicated on high-quality data to develop reports and dashboards with actionable insights
  • Worked with the ETL team to document the transformation rules for Data migration from OLTP to Warehouse environment for reporting purposes.
  • Précised Development and implementation of several types of sub-reports, drill down reports, summary reports, parameterized reports, and ad-hoc reports using Tableau.
  • Acquainted with parameterized sales performance reports, done the reports every month and distributed them to respective departments/clients using Tableau.
  • Used Spark-SQL to load JSON data and create schema RDD and loaded it into Hive Tables and handled structured data using Spark SQL
  • Worked on ingestion of applications/files from one Commercial VPC to OneLake.
  • Worked on building EC2 instances, Creating IAM user’s groups and defining policies.
  • Worked on creating S3 buckets and giving bucket policies as per client requirement.
  • Performed data wrangling to clean, transform and reshape the data utilizing pandas library. Analyzed data using SQL, Scala, Python, Apache Spark and presented analytical reports to management and technical teams.
  • Creating the High Level and Low-Level design document as per the business requirement and working with offshore team to guide them on design and development.
  • Continuously monitoring for the processes which are taking longer than expected time to execute and tune the process.
  • Optimized current pivot tables' reports using Tableau and proposed an expanded set of views in the form of interactive dashboards using line graphs, bar charts, heat maps, tree maps, trend analysis, Pareto charts and bubble charts to enhance data analysis.
  • Monitor system life cycle deliverables and activities to ensure that procedures and methodologies are followed, and that appropriate complete documentation is captured.

Environment: SQL, Snowflake, Python 3.x (Scikit -Learn/ Keras/ SciPy/ NumPy/ Pandas/ Matplotlib/ NLTK/ Seaborn), Tableau (9.x/10.x), Hive, Databricks, Airflow, PostgreSQL, AWS, JIRA, GitHub.

Confidential

Data Engineer

Responsibilities:

  • Interacting with Dev team, Business Team, Data Analyst, and Data architects for understanding project requirements, technical design, and source to target data mappings.
  • Analyze requirements, source systems, and prepare source to target mappings.
  • Worked on the core and Spark SQL modules of Spark extensively.
  • Developed data pipeline using sqoop, HQL, Spark and Kafka to ingest Enterprise message delivery data into HDFS.
  • Create complex HQLs to load data from External to Internal hive table.
  • Used Python and Shell scripting to build pipelines.
  • Worked on partitioning and bucketing HIVE tables to improve the query performance for business users.
  • Worked on importing the real time data to Hadoop using Kafka and Spark Streaming.
  • Created aggregated hive tables using HQLs to create tableau reports and export into RDBMS.
  • Involved in converting source SQL queries and Hive HQLs into Spark Scala transformation.
  • Developed custom aggregate functions using Spark SQL and performed interactive querying on a POC level.
  • Worked on Data Serialization formats for converting Complex objects into sequence bits by using AVRO, PARQUET, JSON, CSV formats.
  • Virtualized the servers using the Docker for the test environments and dev-environments needs. And, configuration automation using Docker containers.
  • Implemented batch processing using Control M tool.
  • Responsible for performing extensive data validation between Hive tables and RDBS tables.
  • Responsible for leading offshore team.

Environment: Hortonworks distribution, IBM Data Stage, Map Reduce, HDFS, Spark, Scala, Python Hive, Pig, HBase, SQL, Sqoop, Flume, Oozie, Apache Kafka, Tez, Tableau

Confidential

Big Data Engineer

Responsibilities:

  • Interacting with end users, obtaining their needs, and then designing and implementing the architecture with big data technologies such as spark, python, and spark sql.
  • Extensive work utilizing pyspark and scala to combine data coming from numerous sources. These sources include sqlserver, teradata, ftp, and csv files.
  • Constructed a data intake pipeline in order to import data from a variety of sources and provide processed data to end users such as data Scientists.
  • Generated data for Tableau developers to use in the generation of daily and weekly reports, as well as analytics reports, and reports on client feedback.
  • Involved in the process of importing data into HDFS from a relational database using Sqoop.
  • Involved in the process of importing data from AWS S3 buckets into spark RDDs and DataFrames respectively.
  • Used big data technologies, worked on the creation of pipelines for both one-time loads and daily incremental workloads.
  • Used big data technology in accordance with the demands of the business in order to carry out analyses on the website on occurrences such as the number of users who visited certain pages, activities carried out by the users, and food that was purchased.
  • Utilized Python to write webscraping code that was implemented on numerous websites so that automated programs could login and retrieve the files.
  • Developed Spark scripts by making use of Scala commands in order to load the nested JSON files.
  • Experience in writing shell scripts for automating statistics for impala to improve the query performance.

Environment: MapReduce, HDFS, Hive spark sql, Impala, pyspark, scala, Nifi, Informatica, Teradata, Snowflake warehouse, Python, Airflow, Sqoop, Unix scripting,Python scripting,Mysql

Confidential

ETL Developer

Responsibilities:

  • Developed procedures for extracting, cleaning, converting, integrating, and loading data into staging tables using the DataStage Designer.
  • Extensive usage of ETL for loading data into Informix Database Server from IBM DB2 databases, XML, and flat files as sources.
  • Used software from IBM WebSphere to help with the analyzing, planning, designing, developing, and putting projects into action.
  • Monitoring the Datastage job on a daily basis by executing the UNIX shell script and making a force start anytime the task is unable to start normally.
  • Conceived and implemented complicated projects by using a variety of stages, including Lookup, Join, Transformer, Dataset, Row Generator, Column Generator, Datasets, Sequential File, Aggregator, and Modify Stages.
  • Constructed and altered batch scripts in order to transfer files from one server to another using the data stage server.
  • The slowly changing dimension Type 2 technique was used extensively throughout the process of maintaining history in the database.
  • Developed Task Sequencers in order to automate the process of the job.
  • Adapted an existing UNIX shell script in order to invoke Job sequencer from inside the mainframe job.

Environment: IBM Infosphere Information Server Datastage/Quality Stage 11.3, SQL Server, DB3, Unix Shell Script, MS Access, Oracle 11g, IBM DB2, Netezza

We'd love your feedback!