We provide IT Staff Augmentation Services!

Sr Data Engineer Resume

4.00/5 (Submit Your Rating)

Los Angeles, CA

PROFESSIONAL SUMMARY:

  • 8+ years of experience in designing and developing end - to-end big data solutions using spark, Hadoop, HBase, hive, Sqoop and Aws services.
  • Extensive experience in analysing data using Hadoop ecosystems including Pyspark, Sqoop, flume, Kafka, HDFS, hive, pig, impala, oozie, zookeeper, OLR, NIFI, Spark sql, spark streaming.
  • Solid Experience on Azure Data Lake Analytics. Azure Data Bricks, Azure Data Lake Storage, Azure Data Factory, Azure SQL databases and Azure SQL Data Warehouse for providing analytics and reports for improving marketing strategies
  • Strong experience using HDFS, ma produce, hive, pig, spark, Sqoop, oozie, and HBase.
  • Implemented and deployed libraries using enterprise artifactory to implement and use version of anaconda & miniconda for pandas and NumPy usage.
  • Proficient in using jupyter notebook and jupyter lab along with pyspark and other python libraries like pandas and matplotlib for extensive data munging and large-scale data quality and analysis.
  • Extensive experience in text analytics, developing different statistical machine learning, data mining solutions to various business problems and generating data visualizations using r, python and tableau.
  • Experience with configuring Aws ec2 instances, EMR cluster with s3 buckets, auto-scaling groups and cloud watch.
  • Experience in the field of data warehouse using ETL tools such as informatica power Center, power mart 9x/8x databases as db2, oracle, my sql server and Teradata.
  • Experienced working with various Hadoop distributions (cloud era, Horton works, mapr, amazon EMR) to fully implement and leverage new Hadoop features.
  • Extensively used pyspark to generate datasets in Palantir foundry.
  • Experience in developing spark applications using spark rdd, spark-sql and data frame Apis.
  • Experience in using docker and amazon web services (Aws) infrastructure with automation and configuration management tool such as ansible.
  • Strong knowledge on creating Extract, Transform and Load (ETL) packages in SQL Server integration services for data migration between various databases.
  • Proficient in sql databases MySQL, PostgreSQL, oracle and NoSQL databases mongo DB, Cassandra, HBase.
  • Hands on experience in understanding of Hadoop building and hands on involvement with Hadoop segments such as job tracker, task tracker, name node, data node and HDFS framework.
  • Deep knowledge of troubleshooting and tuning spark applications and hive scripts to achieve optimal performance.
  • Experience with web services (soap, rest). Effective in executing multiple tasks and assignments ahead of schedule.
  • Well versed in writing test cases using nose, unit test and robot test frameworks
  • Excellent experience with python development under Linux/Unix os (Debian, ubuntu, suse Linux, red hat Linux, fedora) and windows OS.
  • Expertise in writing Hadoop jobs for analysing data using hive ql (queries), pig Latin (data flow language), and custom map reduce programs in java.
  • Experienced data pipelines using Kafka and akka for handling large terabytes of data.
  • Good hands-on experience in creating the rdd's, data frames for the required input data and performed the data transformations using spark Scala.
  • Capable of processing large sets of structured, semi-structured and unstructured data and supporting systems application architecture
  • Experience in leading multiple efforts to build Hadoop platforms, maximizing business value by combining data science with big data.
  • Experienced in developing object relation mappings using hibernate, jdbc, spring jdbc and spring data with rdms (oracle, db2, and my sql) and NoSQL (mongodb, Cassandra) databases.
  • Involved in entire life cycle of the projects including design, development, and deployment, testing, implementation and support.
  • A self-starter with a positive attitude, willingness to learn new concepts and acceptance of challenges.
  • Good experience with network communication protocols and high-scalability architecture.

TECHNICAL SKILLS:

Big Data Frameworks: Hadoop, Spark, PySpark, Scala, Hive, AWS, HBase, Flume, Sqoop, Kafka

Databases: Oracle, SQL Server, Mongo DB, Cassandra, Teradata, MS SQL

Languages/technology: Sql, python, java, c#, c, Hadoop, spark, Kafka, EMR, hive, pig, HBase, AWS, Sqoop, hue, oozie, maven, GitHub, Jenkins, glue.

Databases: HBase, dynamo dB, PostgreSQL, db2, MySQL, mongo dB, redshift

Os: Windows (nt, xp), windows 2000, windows 7, Linux, Unix

Web: Xml, html, JavaScript, rest api, web services, ajax, and CSS

Tools: Eclipse, IntelliJ, Ms visual studio, Luigi, Hadoop cluster, Erwin, Jira, data bricks

Cloud technologies: Aws, GCP, Palantir foundry, Azure, Snowflake

Version Tools: Git, Bitbucket

DB languages: MySQL, PL/SQL, PostgreSQL and Oracle

Reporting tools: Tableau, quick sight, periscope

PROFESSIONAL EXPERIENCE:

Confidential, Los Angeles, CA

Sr Data Engineer

Responsibilities:

  • Design, create, test and deploy spark ETL pipelines hosted on EMR cluster to automate reports and dashboards delivery for stakeholders. Generated reports for organization’s flagship product that increased customer base and retention.
  • Creating datasets used by data science for improving prediction accuracy, developed framework to improve data cataloging and data set indexing. Maintain and monitor ETL pipelines processing Peta bytes of data on aws suite.
  • Created spark applications to perform various data cleansing, validation, transformation according to the requirement.
  • Developed customized Sqoop scripts for loading data from different databases like Oracle, MSSql, Teradata
  • Developed PySpark count check utility for hive and RDBMS sources for data validation
  • Have extensive experience in using jupyter notebook with python, pyspark, pandas, matplotlib and NumPy.
  • Compiled data quality metrics using web scraping using pandas and pyspark for 500+ tables daily.
  • Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools, Python, Shell Scripting and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling using python, Unix and Sql.
  • Worked with ETL team to document the transformation rules for Data Migration from OLTP to warehouse environment for reporting purposes.
  • Designed various Jenkins jobs to continuously integrate the process and executed CI/CD pipelines using Jenkins.
  • Developed an end-end (ingestion to monitoring) nifi data flow to process huge volume of streaming xml and json events data from iot sensors positioned on locomotive engines, base stations, way side detectors.
  • Have extensively used Apache spark features such as rdd operations (mapping, merging, combining, aggregation of data, vectorization of data etc.) And data frames and datasets for transformation, enrichment of data, data storage operations, applying descriptive statistics, and aggregation of data.
  • Worked on Big data on AWS cloud services i.e, Ec2, S3, EMR and DynamoDB.
  • Have extensively worked on developing pyspark scripts for data ingestion, transformations, building data pipeline for data scientists, data analysis and business analysts.
  • Worked on Integration testing, Big Data Integration and Analytics based on Hadoop, SOLR, Spark, Kafka, Strom and web methods.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using python and NoSQL database such as HBase and Cassandra
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython.
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labelling and for all Cleaning and conforming tasks.
  • Involved in migration other databases to Snowflake.
  • Effective data management techniques and operating procedure help in understanding the needs of various use cases arising in real time scenarios.
  • Unit tested the data between Redshift and Snowflake
  • Evaluating data file submissions and developing/maintaining SSIS packages for ETL processes.
  • Implemented advanced data transformation logics like forward filling using pyspark for special multilayout use cases.
  • Performed and monitored control-m jobs for varies scheduling and queuing the jobs in data pipelines.
  • Created user guides and presentations for effective and efficient data analytics for analyst use cases.
  • Leverage research analyze, deduce insights from NoSQL Datawarehouse using data bricks. Optimize, improve performance of pipelines following best practices and handling data skews.

Environment: Spark, pyspark, sparksql, python, GCP, Sqoop, Java, Hadoop, CI/CD, Pyspark, aws, Luigi, Redshift, Kafka, Snowflake, Scrum, Data bricks, shell scripting, Data Migration Services, MySQL, Athena, mapreduce, hive, agile, Shell Scripting, Cassandra,, unix emr, NoSQL, aws, Kafka, HDfs, Jenkins.

Confidential, Los Angeles, CA

Sr. Data Engineer

Responsibilities:

  • Involved in generating the pyspark framework for generating the data frames in Palantir foundry.
  • Responsible for the contour graphs using spark data frames wif in Palantir foundry.
  • Involved in scheduling the Palantir foundry jobs to run as trigger event or on schedule time.
  • Setup transfers of data feeds from source systems into location accessible to foundry.
  • Optimized the data sets by creating partitioning and bucketing in hive and performance tuning of hive queries.
  • Ingest new data sources using foundry’s data ingestion uis.
  • Debug issues related to delayed or missing data feeds.
  • Written transformations in pyspark and spark sql and derive new datasets.
  • Monitor build progress and debug problems.
  • Using spark for distributed computation.
  • Debbuged and recommended few optimizations in creating master dataset which is used for pattern extraction in Rules Genaration Algorithm (RGA) PySpark model.
  • Extracted all the Transmit reporting fields from very large Json files stored in Mongo DB by applying various filters.
  • Developed Pyspark application for creating Payfone reporting tables with different maskings in both Hive and MySQL DB and made available for newly build fetch API’s
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in in Azure Databricks.
  • Managed assets and scheduling over the cluster utilizing Azure Kubernetes Service.
  • Used Databricks to integrate easily with the whole Microsoft stack.
  • Designed and mechanized Custom-constructed input connectors utilizing Spark, Sqoop and Oozie to ingest and break down informational data from RDBMS to Azure Data lake.
  • Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and cloud (Blob, Azure SQL DB).
  • Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data using the SQL Activity.
  • Developed shell scripts for loading zip files, CVS files from Azure blob to Data lake.
  • Experience in implementing efficient storage formats like avro, parquet and orc.
  • Using foundry’s application development framework to design applications that address operational questions
  • Involved in performance tuning the spark sql and analyzing the spark logs and Dag on Palantir foundry.
  • Rapid development and iteration cycles with sme’s.
  • Testing and troubleshooting application issues.
  • Investigated data questions surrounding the application.

Environment: Spark, pyspark, sparksql, python, Sqoop, PyCharm, aws, Luigi, MongoDB, Data bricks, MySQL, shell scripting, MySQL, Athena, mapreduce, hive, Jenkins, Data Migration Services, agile, emr, NoSQL, aws, Kafka, HDfs, Jenkins, Azure (HDInsight, Databricks, Data Lake, Blob Storage, Data Factory, SQL DB, SQL DWH, AKS).

Confidential, Los Angeles, Ca

Data Engineer.

Responsibilities:

  • Interact and get required data files of ffcu members from experienced team.
  • Getting data into data lake environment as consumer require.
  • Have extensively used Apache spark features such as rdd operations (mapping, merging, combining, aggregation of data, vectorization of data etc.) And data frames and datasets for transformation, enrichment of data, data storage operations, applying descriptive statistics, and aggregation of data.
  • Worked on designing hive tables to both manage and use externally with data pipeline and interactive querying.
  • Used scoop api to perform data ingestion to raw zone.
  • Developed data format file that is required by the model to perform analytics using spark sql and hive query language.
  • Developed Spark programs using scala APIs to compare the performance of Spark with Hive and SQL
  • Implemented Spark using Scala and SparkSql for faster testing and processing of data.
  • Analyzed the sql scripts and designed it by using pyspark sql for faster performance.
  • Responsible for writing test plans and executing tests via automation and manually.
  • Developed database layer including tables and stored procedures in MySQL.
  • Utilized spark and data frames api in spark for writing custom transformations and data aggregations.
  • Created hive tables using partitions for optimal usage.
  • Utilized spark-sql and data frames api in spark for writing custom transformations and data aggregations.
  • Developed using spark data frame to perform analytics on hive data.
  • Development and deliver under Kanban methodology.
  • Integrated big data into traditional etl for extraction, transformation, and loading of structured and unstructured data for analytics/solutions, operations, client services, product management.

Environment: lambda, spark, Athena, hive, Azure, HDfs, HBase, Sqoop, oozie, map reduce, redshift.

Confidential

Hadoop Developer

Responsibilities:

  • Optimizing the hive queries using partitioning and bucketing techniques, for controlling the data distribution.
  • Written Sqoop scripts to ingest data from different rdbms data source.
  • Used flume and Sqoop extensively in gathering and moving data files from application servers to Hadoop distributed file system (hdfs).
  • Strong experience in writing Perl scripts covering data feed handling, implementing business logic, communicating with web-services
  • Developed pig Latin scripts and pig command line transformations for data joins and custom processing of map reduce outputs and loading tables from Hadoop to various clusters.
  • Worked in migrating hive ql into impala to minimize query response time.
  • Designed and developed MySQL stored procedures and shell scripts for data import/export and conversions.
  • Used git and Jira for code submissions.
  • Implemented the hive partitions, hive joins, hive bucketing.
  • Implemented near real time data pipeline using framework based on Kafka, spark.
  • Written oozie workflow for scheduling jobs and for writing pig scripts and hive ql.

Environment: Hive, Sqoop, Kafka, pig, flume, Java, hcatalog, hue.

Confidential

Bi and ETL developer

Responsibilities:

  • Designed, developed and architect data warehouse and data mart to support reporting and business intelligence solutions.
  • Designed and developed complex ssis packages using sql server and Teradata.
  • Closely working with Accounting, Actuarial, Client and internal DataMart team.
  • Closely worked with client, GDS team, SAS Global hosting team and SAS support team.
  • Customized SAS ETL scripts in SAS Studio, customized workflow, added maker checker position in RGF, customized email notification template in IRM server.
  • Scheduled ETL flow in SKED scheduler, designed and implemented UAM in SMC, prepared data backup and retention document, involved in UAT and PROD shakedown activity. Involved in test management activity, worked on JIRA.
  • Accomplished the of Risk Solution for IFRS17, VIYA 3.5
  • Responsible for requirement gathering, POC, designing, developing, and delivering Microsoft Power BI reports.
  • Developed line and stacked column chart, used drill down-up, drill through, summarized data to show the machine status, total worked duration in percentage.
  • Refreshing the power BI datasets using power shell script (API), DAX and Using power BI scheduler on cloud (SaaS), sharing reports to users using Power BI APP, SharePoint.
  • Developing the flow for email alerts in a Microsoft Flow.
  • Validating columnar data requirement, defining business logic and DataMart design, develop high level solution architecture, design FSD, TSD, build takaful Datamart using Informatica PowerCenter, develop test scenario, execute test cases.
  • Involved in test management activity, worked on JIRA, handled defect debrief call with client.
  • Involved in developing etl solutions using informatica.
  • Primarily involved in developing etl work flows and stored procedures for building data ware house.
  • Involved in requirement gathering and design sessions.
  • Handled data modeling, profiling and validation.
  • Production support and maintenance.

Environment: Microsoft ssis 2008, IBM info sphere data architect, Teradata, Teradata sql assistant 12.0, sql server 2008, informatica power center 7.1.2/7.1.3 (repository manager, designer & server manager), Cognos 7.x/8.x suite, Cognos java api.

We'd love your feedback!