We provide IT Staff Augmentation Services!

Sr. Data Engineer/big Data Engineer Resume

3.00/5 (Submit Your Rating)

Ny City, NY

SUMMARY

  • About 7 years of professional experience in IT, working with various Legacy Database systems, which includes work experience in Big Data technologies as well.
  • Experience working with data modeling tools like Erwin and ER/Studio..
  • Extensive experience in automating ETL procedures with UNIX shell scripting.
  • Experience working with Azure Blob Storage, Azure Data Lake, Azure Data Factory, Azure SQL, Azure SQL Datawarehouse, Azure Analytics, Polybase, Azure HDInsight, Azure Databricks.
  • Strong background in the design and implementation of business intelligence solutions for staging, including the use of the ETL tool Informatica Power Center 9.x/8.x/7.x/6.x to build operational data stores (ODS), enterprise data warehouse (EDW), data marts, and decision support systems.
  • Extensively used Teradata utilities including BTEQ, FastExport, FastLoad, MultiLoad, TPump, and yaml scripts to export and load data to/from various source systems, including flat files. Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
  • Expertise in using various Hadoop infrastructures such asMap Reduce, Pig, Hive, Zookeeper, Hbase, Sqoop, Oozie, Flume, Drillandsparkfor data storage and analysis.
  • Experience in developing data pipelines using AWS services including EC2, S3, Redshift, Glue, Lambda functions, Step functions, CloudWatch, SNS, DynamoDB, SQS
  • Expertise in writing yaml scripts, stored procedure, and Macro in Teradata.
  • Experience in developing custom UDFs for Pig and Hive to in corporate methods and functionality of Python/Java into Pig LatinandHQL(HiveQL) and Used UDFs from Piggy bank UDF Repository.
  • Worked with Informatica Power Center 10.1/9.5/9.1/8.6 , Informatica Data Quality (IDQ) 9.5/9.1 as ETL tool for extracting, transforming, and loading data from various source data inputs to various targets.
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Implemented various algorithms for analytics using Cassandra with Spark and Scala.
  • Expertise in Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk (EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline, Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch, CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
  • Development level experience in Microsoft Azure providing data movement and scheduling functionality to cloud - based technologies such as Azure Blob Storage and Azure SQL Database.
  • Experience in converting SIS packages & Hadoop Hive QL to Informatica.
  • Experience in working with NoSQL databases like HBase and Cassandra.
  • Installed and configured Hadoop Map Reduce, HDFS, developed multiple Map Reduce jobs in java and Scala for data cleaning and preprocessing.
  • Developed extraction mappings to load data from Source systems to ODS to Data Warehouse.
  • Involved in conceptual, logical and physical datamodeling and used star schema in designing the data Warehouse.
  • Experience in set up and maintenance of Auto - Scaling AWS stacks.
  • Used GitHub version control tool to push and pull functions to get the updated code from repository.
  • Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
  • Deployed Spark Cluster and other services in AWS using console.
  • Developed extraction mappings to load data from Source systems to ODS to Data Warehouse.
  • Involved in conceptual, logical and physical datamodeling and used star schema in designing the data Warehouse.
  • Developed simple and complex MapReduce programs in Java for Data Analysis.
  • Deployed Data Lake cluster with Hortonworks Ambari on AWS using EC2 and S3.
  • Good communication skills, work ethics and the ability to work in a team efficiently with good leadership skills.
  • Developed spark applications in python (Pyspark) on distributed environment to load huge number of CSV files with different schema in to Hive ORC tables

TECHNICAL SKILLS

BigData/Hadoop Technologies: MapReduce, SparkSQL, Spark Streaming,Kafka,PySpark,AWS, Pig, Hive,HBase, Flume, Yarn, Oozie, Zookeeper

Languages: C, C++, XML,R/R Studio, SAS Enterprise Guide, SAS, R, MATLAB, Mathematica, FORTRAN, DTD, Schemas, Json, Ajax, Java, Scala, Python, Java Script, Shell Scripting

NO SQL Databases: Cassandra, HBase, MongoDB, Maria DB

Development Tools: Microsoft SQL Studio, IntelliJ, Azure Databricks, Eclipse, NetBeans.

Public Cloud: EC2, IAM, S3, Auto scaling, CloudWatch, Route53, EMR, RedShift

Development Methodologies: Agile/Scrum, UML, Design Patterns, Waterfall

Build Tools: Jenkins, Toad, SQL Loader,PostgreSQL, Talend,Maven, ANT, RTC, RSA, Control-M, Oozie, Hue, SOAP UI

Reporting Tools: MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal reports XI, SSRS, cognos.

Databases: Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 10g, 11g, 12c, DB2, Teradata, Netezza

Operating Systems: All versions of Windows, UNIX, LINUX, Macintosh HD, Sun Solaris

PROFESSIONAL EXPERIENCE

Confidential, NY city, NY

Sr. Data Engineer/Big Data Engineer

Responsibilities:

  • Build and Maintain data processing infrastructure that support complex analysis across data science and experimentation teams.
  • Developed HIVE UDFs to in corporate external business logic into Hive script and developed join data set scripts using HIVE join operations.
  • Developed Apache Spark applications by using spark for data processing from various streaming sources.
  • Design, develop, and maintain Azure data pipelines for real-time/batch analysis, reporting, optimization, data collection, and related functions
  • Created technical ETL mapping documents and test cases for each mapping in preparation for upcoming developments to sustain the software development life cycle (SDLC).
  • Implemented large scale technical solutions using Object Oriented Design and Programming concepts using Python
  • Performed SSIS tuning for efficiency ETL and utilized best practices SSIS design pattern to improve ETL performance and scalability.
  • Extract Transform and Load data from Sources Systems to Azure Data Storage services using a combination of Azure Data Factory, T-SQL, Spark SQL, and U-SQL Azure Data Lake Analytics.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
  • Designed workflows and coordinators, managed in Oozie and Zookeeper to automate and parallelize Hive, Sqoop and Pig jobs in Cloudera Hadoop using XML.
  • Responsible to manage data coming from different sources through Kafka.
  • Installed Kafka Producer on different severs and scheduled to produce data for every 10 seconds
  • Responsibilities include gathering business requirements, developing strategy for data cleansing and data migration, writing functional and technical specifications, creating source to target mapping, designing data profiling and data validation jobs in Informatica, and creating ETL jobs in Informatica.
  • Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
  • Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS, PowerShell.
  • Architect, build and launch efficient and reliable data pipelines. Automated ETL. Minimum human supervision.
  • Writing HiveQL as per the requirements and Processing data in Spark engine and store in Hive tables.
  • Importing existing datasets from Oracle to Hadoop system using SQOOP.
  • Created Sqoop jobs with incremental load to populate Hive External tables.
  • Writing the Spark Core Programs for processing and cleansing data thereafter load that data into Hive or HBase for further processing.
  • Ability to work effectively in cross-functional team environments, excellent communication, and interpersonal skills. Involved in converting Hive/SQL queries into Spark transformations using Spark Data frames and Scala.
  • Good working experience on Spark (spark streaming, spark SQL) with Scala and Kafka. Worked on reading multiple data formats on HDFS using Scala.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
  • Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
  • Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer Developed Spark Sql application, Big Data Migration from Teradata to Hadoop and reduce Memory utilization in Teradata analytics.
  • Unit Testing, Integration Testing and Performance Testing of Informatica, Talend, IRIS tool jobs and stored procedures.
  • Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing.
  • Designed the data aggregations on Hive for ETL processing on Amazon EMR to process data as per business requirement
  • Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, Pyspark.
  • Utilized Sqoop to import structured data from MySQL, SQL Server, PostgreSQL, and a semi-structured csv file dataset into HDFS data lake.
  • Developed solutions to leverage ETL tools and identify opportunities for process improvements using Informatica and Python.
  • Analyzed data, identify anomalies, and provide usable insight to customers.
  • Involved in review of functional and non-functional requirements.
  • Developed the Spark Sql logics which mimics the Teradata ETL logics and point the output Delta back to Newly Created Hive Tables and as well the existing TERADATA Dimensions, Facts, and Aggregated Tables.
  • Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce
  • Used Apache NiFi to copy data from local file system to HDP.
  • Responsible for performing extensive data validation using Hive.
  • Sqoop jobs, Hive scripts were created for data ingestion from relational databases to compare with historical data.

Environment: Hadoop, (HDFS, MapReduce), Scala, Databricks, Informatica, Yarn, IAM, PostgreSql, Spark, Impala, Hive, Mongo DB, Pig, Zookeeper, Devops, HBase, Teradata, Oozie, Hue, Sqoop, Flume, Oracle, NIFI, Git, AWS Services (Lambda, EMR, Autoscaling).

Confidential, Minneapolis, MN

Big Data Engineer

Responsibilities:

  • Worked with subject matter experts and project team to identify, define, collate, document, and communicate the data migration requirements.
  • Involved using ETL tool Informatica to populate the database, data transformation from the old database to the new database using Oracle.
  • Experience in Converting existing AWS Infrastructure to Server less architecture (AWS Lambda, Kinesis), deploying via Terraform and AWS Cloud Formation templates
  • Worked on SparkSQL, created Data frames by loading data from Hive tables and created prep data and stored in AWS S3.
  • Worked closely with the ETL SSIS Developers to explain the complex Data Transformation using Logic.
  • Developed Spark scripts by using Scala shell commands as per the requirement.
  • Developed Scala scripts using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into OLTP system through Sqoop.
  • Developed batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework.
  • Created a Tableau report on the Teradata solution to provide business with their day-to-day audit reporting needs
  • Extensively worked in database components like SQL, PL/SQL, Stored Procedures, Stored Functions, Packages and Triggers.
  • Involved in creating Hive tables, loading with data, and writing hive queries which will run internally in map reduce way.
  • Collaborated with ETL, and DBA teams to analyze and provide solutions to data issues and other challenges while implementing the OLAP model.
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real-time and persist it to Redshift clusters.
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands with Crontab.
  • Extensively used Hive/HQL or Hive queries to query data in Hive Tables and loaded data into HBase tables.
  • Extensively worked with Partitions, Dynamic Partitioning, bucketing tables in Hive, designed both Managed and External tables, also worked on optimization of Hive queries.
  • Haveexperience ininstalling,configuring,and administrating Hadoop cluster for major Hadoop distributions likeCDH4, and CDH5.
  • Created SAS ODS reports using SAS EG, SAS SQL, and OLAP Cubes.
  • Installed Ranger in all environments for Second Level of security in Kafka Broker.
  • Built performant, scalable ETL processes to load, cleanse and validate data
  • Created a parallel branch to load the same data to Teradata using Sqoop utilities
  • Collaborate with team members and stakeholders in design and development of data environment
  • Preparing associated documentation for specifications, requirements, and testing
  • Designed and implemented MapReduce-based large-scale parallel relation-learning system
  • Extracted data from databases like Oracle, SQL server and DB2 using Informatica to load it into a single repository for data analysis.
  • Provide troubleshooting and best practices methodology for development teams.
  • This includes process automation and new application onboarding.
  • Produce unit tests for Spark transformations and helper methods. Design data processing pipelines.
  • Configuring IBM Http Server, WebSphere Plugins and WebSphere Application Server Network Deployment (ND) for user work-load distribution.
  • AssistedDeploymentteamin setting upHadoop clusterand services.
  • Good experience in Generating Statistics/extracts/reports from the Hadoop.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.
  • Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.

Environment: Apache Beam, Cloud Dataflow, Cloud Shell, Azure Cloud Sql, Mysql, Posgres, Sql Server, Teradata, Python, Scala, Spark, informatica, Hive, Spark-Sql.

Confidential, Dublin, OH

Data Engineer

Responsibilities:

  • Gathering data and business requirements from end users and management. Designed and built data solutions to migrate existing source data in Data Warehouse to Atlas Data Lake (Big Data)
  • Performed all the Technical Data quality (TDQ) validations which include Header/Footer validation, Record count, Data Lineage, Data Profiling, Check sum, Empty file, Duplicates, Delimiter, Threshold, DC validations for all Data sources.
  • For real-time/batch analysis, reporting, optimization, data collecting, and associated tasks, design, manage, and maintain Azure data pipelines.
  • Designed Data Stage ETL jobs for extracting data from heterogeneous source systems, transform and finally load into Data Warehouse
  • Performed data validation, data profiling, data auditing and data cleansing activities to ensure high quality Cognos report deliveries.
  • Worked with senior management to plan, define and clarify tableau dashboard goals, objectives and requirement.
  • Data Ingestion to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure DW) and processing the data in InAzure Databricks.
  • Developed Python scripts to extract the data from the web server output files to load into HDFS.
  • Created effective Test Cases and performed Unit and Integration Testing to ensure the successful execution of data loading process
  • Involved in HBASE setup and storing data into HBASE, which will be used for further analysis.
  • Involved in writing Oracle stored procedures and functions for calling during the execution of Informatica mapping or as Pre-or Post-session execution.
  • Worked on Cloud Health tool to generate AWS reports and dashboards for cost analysis.
  • Written a python script which automates to launch the EMR cluster and configures the Hadoop applications.
  • Extensively worked with Avro and Parquet files and converted the data from either format Parsed Semi Structured JSON data or converted to Parquet using Data Frames in PySpark.
  • Developed workflow in Oozie also in Airflow to automate the tasks of loading data into HDFS and pre-processing with Pig and Hive.
  • Actively involved in Analysis phase of the business requirement and design of the Informatica mappings.
  • Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark - SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
  • Created several types of data visualizations using Python and Tableau. Extracted Mega Data from AWS using SQL Queries to create reports.
  • Developed and designed system to collect data from multiple portals using Kafka and then process it using spark.
  • Involved in loading data from rest endpoints to Kafka Producers and transferring the data to Kafka Brokers.
  • Developed Preprocessing job using Spark Data frames to flatten JSON documents to flat file.
  • Troubleshoot RSA SSH keys in Linux for authorization purposes.
  • Inserted data from multiple csv files into MySQL, SQL Server, and PostgreSQL using spark.

Environment: Spark, Redshift, Python, AWS, HDFS, Hive, Pig, Scala, informatica, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Teradata, Git, Oozie, Spark, Cloudera, Oracle11g/10g, PL/SQL, Unix.

Confidential

Data Engineer

Responsibilities:

  • Jobs in java for data cleaning and preprocessing.
  • Gathered all the Sales Analysis report prototypes from the business analysts belonging to different Business units
  • Worked with Master SSIS packages to execute a set of packages that load data from various sources onto the Data Warehouse on a timely basis.
  • Designed and implemented appropriate ETL mappings to extract and transform data from various sources to meet requirements with Informatica.
  • Involved in Data Extraction, Transformation and Loading (ETL) from source systems.
  • Setup and benchmarked Hadoop/HBase clusters for internal use
  • Developed spark code and spark-SQL/streaming for faster testing and processing of data.
  • Performed detailed data investigation and analysis of known data quality issues in related databases through SQL.
  • Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the requirement.
  • Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
  • S3 - Data Lake Management. Responsible for maintaining and handling data inbound and outbound requests through big data platform.
  • Involved in the configurations set for Web logic servers, DSs, JMS queues and the deployment.
  • Involved in creating queues, MDB, Worker to accommodate the messaging to track the workflows
  • Built PL/SQL (Procedures, Functions, Triggers, and Packages) to summarize the data to populate summary tables that will be used for generating reports with performance improvement.

Environment: Hadoop, HDFS, Hive, Java, Hadoop distribution of Cloudera, Pig, HBase, Linux, XML, Java 6, Eclipse, Oracle 10g, PL/SQL, MongoDB

We'd love your feedback!