We provide IT Staff Augmentation Services!

Sr.data Engineer Resume

5.00/5 (Submit Your Rating)

SUMMARY

  • Over 9+ years of strong experience in Data Analytics, Data mining with large data sets of Structured and Unstructured data, Data Acquisition, Data Validation, Predictive modeling, Statistical modeling, Data modeling and Data Visualization. Adept in statistical programming languages like R and Python, SAS, Apache Spark including Big Data technologies like Hadoop, Hive, Sqoop, Pig.
  • 3+ years of experience in Hadoop 2.0. Led development of enterprise level solutions utilizing Hadoop utilities such as Spark, MapReduce, Sqoop, PIG, Hive, HBase, Zookeeper, Phoenix, Oozie, Flume, streaming jars, Custom SerDe, etc. Worked on proof of concepts on Kafka, and Storm.
  • Deep analytics and understanding of Big Data and Natural Language Processing(NLP), Machine Learning algorithms using Hadoop, MapReduce, NoSQL and distributed computing tools.
  • Expertise in writing Spark RDD transformations, Actions, Data Frames, Case classes for the required input data and performed the data transformations using Spark - Core.
  • Extensively worked on Spark using Python on cluster for computational (analytics), installed it on top of Hadoop performed advanced analytical application by making use of Spark with Hive and SQL/Oracle.
  • Experience in developing enterprise level solution using batch processing (using Apache Pig) and streaming framework (using Spark Streaming, apache Kafka & Apache Flink).
  • Expertise in Data load management, importing & exporting data using SQOOP,FLUME and Kafka messaging system.
  • Sound knowledge with Netezza SQL.
  • Involved in design and development of multiple Power BI Dashboards and reports and Managing data privacy and security in Power BI.
  • Experienced as Lead in managing entire data science project life cycle and actively involved in all the phases of project life cycle including data acquisition, data cleaning, data engineering, features scaling, features engineering, statistical modeling (decision trees, regression models, clustering), dimensionality reduction using Principal Component Analysis.
  • Expertise in synthesizing Machine learning, Predictive Analytics and Big data technologies into integrated solutions.
  • Comfortable with R, Python, SAS and Relational databases. Extensively worked onNumpy, Pandas, Scikit-learn, seaborn, Scipy, Spark, Hive for Data Analysis and Machine Learning Model building and familiar working with NLTK tools for Text Analysis and Deep Learning.
  • Extensively used the Ab Initio components like Reformat, Join, Fuse, Partition by Key, Partition by Expression, Merge, Gather, Sort, Dedup Sort, Rollup, Scan, Lookup. Used Ab Initio features like MFS, check point, phases etc.
  • Experienced in Dimensional Data Modeling experience using Data modeling, Relational Data modeling, ER/ Studio, Erwin, and Sybase Power Designer, Star Join Schema/Snowflake modeling, FACT & Dimensions tables, Conceptual, Physical & logical data modeling.
  • Experienced in CI\CD process.
  • Procedural knowledge in cleansing and analyzing data using HiveQL, Pig Latin, and custom MapReduce programs in Java.
  • Strong experience in design and development of Business Intelligence solutions using Tableau, R shiny, Python flask, data modeling, Dimension Modeling, ETL Processes, Data Integration, OLAP and client /server application.
  • Experienced in writing custom UDFs and UDAFs for extending Hive and Pig core functionalities.
  • Hands on experience in formatting and ETLraw data in various format such as Avro, ORC, Parquet, CSV, JSON etc. Experience in Elasticsearch and MDM solutions.
  • Strong experience and knowledge in Data Visualization with Tableau creating: Line and scatter plots, Bar Charts, Histograms, Pie chart, Dot charts, Box plots, Time series, Error Bars, Multiple Charts types, Multiple Axes, subplots etc.
  • Experienced with Integration Services (SSIS), Reporting Service (SSRS) and Analysis Services (SSAS)
  • Expertise in Normalization to 3NF/De-normalization techniques for optimum performance inrelational and dimensional database environments.
  • Well-versed in version control and CI-CD tools such as GIT, SourceTree, Bitbucket, etc, andGCP, Amazon Web Services (AWS) products S3, EC2, EMR, and RDS.
  • Experience in all stages of SDLC (Agile, Waterfall), writing Technical Design document, Development, Testing and Implementation of Enterprise level Data mart and Data warehouses.

TECHNICAL SKILLS

OLAP/Reporting Tools: SSRS, SSAS, MDX, Tableau, PowerBI

Relational Databases: SQL Server database 2014/2012/2008 R2/2005, Oracle 11g, SQL Server Azure, MS Access, PostgreSql

SQL Server Tools: Microsoft Visual Studio 2010/2013/2015 , SQL server management studio

Big Data Ecosystem: HDFS, Nifi, Map Reduce, Oozie, Hive/Impala, Pig, Sqoop, Zookeeper and Hbase, Spark, Scala, Kafka, Apache Flink, AWS- EC2, S3, EMR.

Other Tools: MS Office 2003/2007/2010 and 2013, Power pivot, Power Builder, GIT, CI-CD, Jupyter Note Book.

Programming languages: C, SQL, PL/SQL, T-SQL, JAVA, Batch scripting, R, Python

Data Warehousing & BI: Star Schema, Snowflake schema, Facts and Dimensions tables, SAS, SSIS, and Splunk

Operating Systems: Windows XP/Vista/7/8 and 10; Windows 2003/2008R2/2012 Servers

PROFESSIONAL EXPERIENCE

Confidential

Sr.Data Engineer

Responsibilities:

  • Involved in requirements gathering, analysis, design, development, change management, deployment.
  • Experienced in designing and deployment of Hadoop cluster and various Big Data components including HDFS, MapReduce, Hive, Sqoop, Pig, Oozie, Zookeeper in Cloudera distribution.
  • Utilized Apache Spark with Python to develop and execute Big Data Analyticsand Machine learning applications, executed machine Learning use cases under Spark ML and Mllib.
  • Migrated an existing on-premises data to AWS S3. Used AWS services like EC2 and S3 for data sets processing and storage.
  • Experienced in Maintaining the Hadoop cluster over Horton works on GCP.
  • Lead the team in developing real time streaming applications using PySpark, Apache Flink, Kafka, Hive on distributed Hadoop Cluster.
  • Experienced in writing live Real-time Processing and core jobs using Spark Streaming with Kafka as a data pipe-line system.
  • Conducted ETL development in the Netezza environment using standard design methodologies.
  • Partitioning data streams using Kafka, designed and configured Kafka cluster to accommodate heavy throughput of 1 million messages per second. Used Kafka producer API's to produce messages.
  • Involved in loading and transforming large Datasets from relational databases into HDFS and vice-versa using Sqoop imports and export.
  • Responsible for loading Data pipelines from webservers and Teradata using Sqoop with Kafka and Spark Streaming API.
  • Designed and Developed applications using Apache Spark, Scala, Python, Redshift, Nifi, S3, AWS EMR on AWS cloud to format, cleanse, validate, create schema and build data stores on S3.
  • Extracted data from heterogeneous sources and performed complex business logic on network data to normalize raw data which can be utilized by BI teams to detect anomalies.
  • Developed Spark jobs in PySpark to perform ETL from SQL Server to Hadoop and worked on Spark Streaming using Kafka to submit the job and start the job working in Live manner.
  • Designed and developed Flink pipelines to consume streaming data from kafka and applied business logic to massage and transform and serialize raw data.
  • Extensively worked on Text, Avro, Parquet, CSV, and Json file formats and developed Data Serialization spark common module for converting Complex objects into sequence bits using them.
  • Responsible for operations and support of Big data Analytis platform and Power BI visualizations.
  • Managed, developed, and designed a dashboard control panel for customers and Administrators using Tableau, PostgresSQL and RESTAPI calls.
  • Worked with the Ab Initio team in development/enhancement of the existing models by adding extensions according to the business needs.
  • Developed CI-CD pipeline to automate build and deploy to Dev, QA, and production environments.
  • Supported production jobs and developed several automated processes to handle errors and notifications. Also, tuned performance of slow jobs by improving design and configuration changes of PySpark jobs.
  • Created standard report Subscriptions and Data Driven Report Subscriptions.

Environment: Hadoop, Map Reduce, Spark, Spark MLLib, Tableau, SQL, Excel, PIG, Hive, AWS, PostgresSQL, Python, PySpark, Flink, Kafka,Sqoop, SQL Server 2012, T-SQL, CI-CD, Git, XML,R,Tableau.

Confidential, Plano, TX

Data Engineer

Responsibilities:

  • Worked on Spark using Python and Spark SQL for faster testing and processing of data.
  • Applied MLlib to build statistical model to classify and predict. Involved in implementation of new statistical algorithms and operators on Hadoop and SQL platforms and utilized optimizations techniques, linear regressions, K-means clustering, Native Bayes and other approaches.
  • Developed Spark streaming pipeline to batch real time data, detect anomalies by applying business logic and write the anomalies to Hbase table.
  • Developed Spark/Scala, Python for regular expression (regex) project in the Hadoop/Hive environment with Linux/Windows for big data resources.
  • Developed Spark batch job to automate creation/metadata update of external Hive table created on top of datasets residing in HDFS.
  • Used Hive QL to analyze the partitioned and bucketed data, Executed Hive queries on Parquet tables stored in Hive to perform data analysis to meet the business specification logic.
  • Lead the integration of Kafka messaging service functionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds for near live stream processing.
  • Used Apache Kafka to aggregate web log data from multiple servers and make them available in Downstream systems for Data analysis and engineering type of roles.
  • Used Spark API over Hortonwork Hadoop YARN to perform analytics on data in Hive in AWS. Experience with Hortonwork tools like Tez and Ambari.
  • Extracted files from MongoDB through Sqoop and placed in HDFS and processed.
  • Hands-on experience with Amazon EC2, Redshift, Amazon S3 for the computing and storage of data.
  • Worked on ERModeling, Dimensional Modeling (StarSchema, SnowflakeSchema), Data warehousing and OLAP tools.
  • Developed common Flink module for serializing and deserializing AVRO data by applying schema.
  • Indexed processed data and created dashboards and alerts in splunk to be utilized/ action by support teams.
  • Implemented layered architecture for Hadoop to modularize design. Developed framework scripts to enable quick development. Designed reusable shell scripts for Hive, Sqoop, Flink and PIG jobs. Standardize error handling, logging and metadata management processes.
  • Designed Batch Audit Process in batch\shell script to monitor each ETL job along with reporting status which includes table name, start and finish time, number of rows loaded, status, etc.
  • Designed and implemented data acquisition, ingestion, Management of Hadoop infrastructure and other Analytics tools(Splunk, Tableau).
  • Working knowledge of build automation and CI/CD pipelines.
  • Developed python scripts to automate data ingestion pipeline for multiple data sources and deployed Apache Nifi in AWS.
  • Design and develop Tableau visualizations which include preparing Dashboards using calculations, parameters,calculated fields, groups, sets and hierarchies.

Environment: Hadoop, Map Reduce, Spark, Spark MLLib, Tableau, SQL, Excel, PIG, Hive, AWS, PostgresSQL, Python, PySpark, Flink, Netezza, Kafka, Horton Works, TEZ,Ambari, SQL Server 2012, T-SQL, CI-CD, Git, XML, Tableau.

Confidential, Plano, TX

Data Engineer

Responsibilities:

  • Worked on Hive UDF’s and due to some security privileges, I have to ended up the task in middle itself.
  • Worked on SparkSQL where the task is to fetch the NOTNULL data from two different tables and loads into a lookup table.
  • Populated HDFS and PostgreSQL with huge amounts of data using Apache Kafka.
  • Used to handle lot of tables and millions of rows in a daily manner.
  • Experience in creating accumulators and broadcast variables in Spark.
  • Worked on submitting the Spark jobs which shows the metrics of the data which is used for Data Quality Checking.
  • Developed Sqoop and Kafka Jobs to load data from RDBMS, External Systems into HDFS and HIVE.
  • Designed and Implement test environment on AWS.
  • Design and develop Rest API (Commerce API) which provides functionality to connect to the PostgreSQL through Java services.
  • Responsible for Designing and configuring Network Subnets, Route Tables, Association of Network ACLs to Subnets and Open VPN.
  • Responsible for Account management, IAM Management and Cost management.
  • Designed AWS Cloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.
  • Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup on AWS.
  • Experience to manage IAM users by creating new users, giving them a limited access as per needs, assign roles and policies to specific user.
  • Act as technical liaison between customer and team on all AWS technical aspects.
  • Involved in Ramp up the team by coaching team members

Environment: Apache Spark,R, Kafka , AWS, Hive, Netezza, Informatica, Talend, AWS Redshift, AWS S3, Apache Nifi, Accumulo, ControlM.

We'd love your feedback!