Sr. Big Data Engineer Resume
Detroit, MI
SUMMARY
- 8+ years of IT expertise with a solid background in Big Data, Hive, Pig, Kubernetes and ETL tool Informatica power center, Informatica cloud using salesforce for customer data and data modeling, data warehousing, ETL data Integration.
- Good knowledge in Software Development Life Cycle SDLC and Software Testing Life Cycle STLC on Agile Scrum, Waterfall, V - Model and Agile Environments.
- Implemented Agile Methodology for building an internal application.
- Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX, and Spark Streaming for processing and transforming complex data using in-memory computing capabilities written in Scala.
- Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark MLlib, Pair RDD's and Spark YARN.
- Expertise in using major components of Hadoop ecosystem components like HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
- Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
- Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development.
- Hands on experience in Data modeling and Dimensional modeling using Kimball methodologies.
- Experience working with NoSQL databases like Cassandra and HBase and developed real-time read/write access to very large datasets via HBase.
- Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables.
- Extensively worked with Teradata utilities like Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Expertise working with AWS cloud services like EMR, S3, Redshift, EMR cloud watch, Autoscaling, Redshift, DynamoDB, Route53 for big data development.
- Hands-on experience in developing and deploying enterprise-based applications using major Hadoop ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX, Spark SQL, Kafka.
- Experience in application of various data sources like Oracle SE2, SQL Server, Flat Files and Unstructured files into a data warehouse.
- Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data Warehouses, as well as data processing like collecting, aggregating and moving data from various sources using Apache Flume, Kafka and PowerBI
- Hands-on experience with Hadoop architecture and various components such as Hadoop File System HDFS, Job Tracker, Task Tracker, Name Node,DataNode and Hadoop MapReduce programming.
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and controlling and granting database access and migrating on premise databases to Azure Data Lake store using Azure Data factory
- Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala and Java for data cleansing, filtering and data aggregation. Also possess detailed knowledge of MapReduce framework.
- Adept at configuring and installing Hadoop/Spark Ecosystem Components.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
- Experience in extracting files from MongoDB through Sqoop and placed in HDFS.
- Hands-on use of Spark andScalaAPI's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames inScala.
- Expertise in Python andScala, Pyspark/Spark user-defined functions (UDF) for Hive and Pig using Python.
- Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from different source systems including flat files.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Having good knowledge in writing MapReduce jobs through Pig, Hive, and Sqoop.
- Experience in working with Flume and NiFi for loading log files into Hadoop.
- Experience in developing customizedUDF’sin Python to extend Hive and Pig Latin functionality.
- Capable of processing large sets (Gigabytes) of structured, semi-structured or unstructured data.
- Experience in analyzing data using HiveQL, Pig, HBase and custom MapReduce programs in Java 8.
- Experience working with GitHub/Git 2.12 source and version control systems.
- Have very strong inter-personal skills and the ability to work independently and with the group, can learn quickly and easily adaptable to the working environment.
TECHNICAL SKILLS
Big Data Tools: Hadoop Ecosystem Map Reduce, Spark 2.3, Spark 3.1,Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig 0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0
Data Modeling Tools: Erwin Data Modeler, ER Studio v17
Programming Languages: SQL, PL/SQL, and UNIX.
Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
Cloud Platform: AWS, Azure, Google Cloud.
Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena, Azure Services - Azure Data Factory, Azure Data Lake, Azure Databricks
RDBMS: Oracle 12c/11g/10g, Mysql. Sql Server
No SQl Databases: MongoDB, Cassandra, HBase
OLAP Tools: Tableau, SSAS, Business Objects and Crystal Reports 9
ETL/Data warehouse Tools: Informatica 9.6/9.1 and Tableau.
Operating System: Windows, Unix, Sun Solaris
PROFESSIONAL EXPERIENCE
Confidential, Detroit, MI
Sr. Big Data Engineer
Responsibilities:
- Developed ApacheSparkapplications by usingScalafor data processing from various streaming sources.
- ConfiguredSparkStreaming to receive real time data from theApache Kafkaand store the stream data toDynamoDBusingScala.
- Compiling and validating data from all departments and Presenting to Director Operation.
- Used AWS EMR to transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.
- Developed a detailed project plan and helped manage the data conversion migration from the legacy system to the target snowflake database.
- Design, develop, and test dimensionaldatamodels using Star andSnowflakeschemamethodologies under the Kimball method.
- Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
- Utilize AWS services that focus on big data architect, analytics and business intelligence solutions to assure optimal design, scalability, flexibility, availability, and performance, as well as to offer relevant and valuable data for improved decision-making.
- Using AWS Redshift, Extracted, transformed, and loaded data from various heterogeneous data sources and destinations.
- Responsible for Design, Development, and testing of the database and Developed Stored Procedures, Views, and Triggers.
- Create, modify and execute DDL in table AWS Redshift and snowflake tables to load data.
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Worked on developing PySpark/Spark script to encrypting the raw data by using hashing algorithms concepts on client specified columns.
- Developed data pipeline using Pyspark /Spark, Hive, Pig, python, Impala, and HBase to ingest customer
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Expertise in analyzing data using Pig scripting, Hive Queries, Sparks (python) and Impala.
- Created functions and assigned roles inAWS Lambdato run python scripts, andAWS Lambdausing java to perform event driven processing. Created Lambda jobs and configured Roles usingAWS CLI.
- Migrated an existing on-premises application to AWS. Used AWS services like EC2 and S3 for small data sets processing and storage, Experienced in Maintaining the Hadoop cluster on AWS EMR.
- Worked on Cluster co-ordination services through Zookeeper.
- Developed workflow in Oozie to manage and schedule jobs on Hadoop cluster to trigger daily, weekly and monthly batch cycles.
- Files extracted from Hadoop and dropped on daily hourly basis intoS3. Working with Data governance and Data quality to design various models and processes.
- Configured cloud watch logs and created aCloudWatchdashboard for monitoring.
- Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL,postgreSQL,Data Frame,OpenShift, Talend,pair RDD's
- Created a Lambda Deployment function, and configured it to receive events from S3 buckets
- Build large-scale data processing systems in data warehousing solutions, and work with unstructured data mining on NoSQL Databases like MongoDB, Hbase, Cassandra.
- Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from database transform, and upload into the Data warehouse servers.
Environment: Apache Spark, Kafka, Scala, AWS, EC2, Redshift, Lambda, DynamoDB, S3 Buckets, CloudWatch, Pig, Impala, Python, Pandas, Pyspark, Star, Snowflake, PL/SQL, Tableau, Oracle 12g, SQL Server, Spark SQL, Openshift, PostgreSQL, Talend, MongoDB, Hbase, Cassandra, Zookeeper, Oozie
Confidential, Westlake, TX
Sr. Data Engineer
Responsibilities:
- Thenear real time reportingwas achieved by anevent-based processingapproach adoption instead ofmicro-batchingto deal with data coming fromKafka.
- Used Scala to convertHive/SQLqueries into RDD transformations inApache Spark.
- ImplementedSparksolutions to generate reports, fetch and load data inHive.
- ImplementedSparkusingScala, Pythonand utilizing Data frames andSpark SQL APIfor faster processing of data.
- Responsible for working with various teams on a project to develop analytics-based solution to target customer subscribers specifically.
- Decommissioning nodes and adding nodes in the clusters for maintenance.
- Monitored cluster health by Setting up alerts using Nagios and Ganglia.
- Adding new users and groups of users as per the requests from the client.
- Have written applications that produced data toKafkaand also consumed data from it.
- Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, Json, Parquet, etc. Involved in loading data from LINUX file system to HDFS.
- Implemented Copy activity, Custom Azure Data Factory Pipeline Activities.
- Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. Installing, configuring, and maintaining Data Pipelines.
- Developed thefeatures,scenarios,step definitionsforBDD (Behavior Driven Development)andTDD (Test Driven Development)usingCucumber, Gherkinandruby.
- KPI calculator Sheet and maintain that sheet within SharePoint.
- Primarily involved in Data Migration using SQL, Azure SQL, Azure Storage, and Azure Data Factory.
- Designing the business requirement collection approach based on the project scope and SDLC methodology.
- Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards.
- Developed various Mappings with the collection of all Sources, Targets, and Transformations using Informatica Designer
- Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data messaging and to migrate clean and consistent data
- Configured Hadoop tools like Hive, Pig, Zookeeper, Flume, Impala and Sqoop.
- Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
- Queried both Managed and External tables created by Hive using Impala.
- Data Integrationingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of data from various sources into a single data warehouse.
- Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS) using Azure Data Factory (ADF V1/V2).
- Create and publish multiple dashboards and reports usingTableau server and work onText Analytics, Naive Bayes, Sentiment analysis, creating word cloudsand retrieving data fromTwitterand othersocial networking platforms.
- Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning usingPython scripts.
- Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW, HDInsight/Databricks, NoSQL DB).
- Tackle highly imbalanced Fraud dataset using under sampling with ensemble methods, oversampling and cost sensitivealgorithms.
- Created action filters, parameters and calculated sets for preparing dashboards and worksheets using PowerBI
- Developed visualizations and dashboards using PowerBI
- Created Tableau reports with complex calculations and worked on Ad-hoc reporting using PowerBI.
- Improve fraud prediction performance by using random forest and gradient boosting for feature selection withPython Scikit-learn.
- Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies.
- Experience in DevelopingSparkapplications usingSpark - SQLinDatabricksfor data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns.
- Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight
- Involved inUnit Testingthe code and provided the feedback to the developers. PerformedUnit Testingof the application by usingNUnit.
- Developed Database applications usingSQLandPL/SQL.
- Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data visualization
- Worked on analyzing Hadoop Cluster and different big data analytic tools including Pig, Hive.
- Working experience with data streaming process with Kafka, Apache Spark, Hive.
- Worked with various HDFS file formats like Avro, Sequence File, Nifi, Json and various compression formats like Snappy, bzip2.
- Built a dashboard of all theYARNapplications running on the cluster using YARN API.
- Write research reports describing the experiment conducted, results, and findings and make strategic recommendations to technology, product, and senior management. Worked closely with regulatory delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter Notebook, Hive and NoSql.
- Performed all necessary day-to-day GIT support for different projects, Responsible for design and maintenance of the GIT Repositories, and the access control strategies.
Environment: Hadoop, Hive, Pig, Spark, Zookeeper, Kafka, Flume, Impala, Sqoop,Azure, Azure data factory, Azure databricks, HdInsight, Azure Data Lake, PowerBI, PL/SQL, Oracle 11g, SQL Server, DB2, MongoDB, Python, Yarn, Git.
Confidential, Los Angeles, CA
Big Data Engineer
Responsibilities:
- DevelopedMapReducejobs in bothPIGandHivefor data cleaning and pre-processing.
- Imported Legacy data from SQL Server and Teradata into Amazon S3.
- Created consumption views on top of metrics to reduce the running time for complex queries.
- Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
- Compare the data in a leaf level process from various databases when data transformation or data loading takes place.
- Developed spark code and spark-SQL/streaming for faster testing and processing of data.
- Closely involved in scheduling Daily, Monthly jobs with Precondition/Post condition based on the requirement.
- Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
- Worked on analyzing Hadoop Cluster and different big data analytic tools.
- Working experience with data streaming process with Kafka, Apache Spark, Hive.
- Worked with various HDFS file formats like Avro, Sequence File, Nifi, Json and various compression formats like Snappy, bzip2
- Developed and ConfiguredKafka brokersto pipeline server logs data into spark streaming.
- Developed Spark scripts by usingscalashell commands as per the requirement.
- Imported the data fromCASSANDRAdatabases and Stored it intoAWS.
- Involved in convertingHive/SQLqueries into Spark transformations using Spark RDDs.
- Used AmazonCLIfor data transfers to and fromAmazon S3 buckets.
- ExecutedHadoop/Sparkjobs onAWS EMRusing programs and data is stored inS3 Buckets.
- Implemented the workflows using ApacheOozieframework to automate tasks.
- ImplementedSpark RDDtransformations, actions to implement business analysis.
- DevelopedSparkscriptsby usingScalashell commands as per the requirement
- Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing, Aggregation and Materialized views to optimize query performance.
- Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts.
- Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
- Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
Environment: HDFS, MapReduce, Snowflake, Pig, Hive, Kafka, Spark, PL/SQL, AWS, S3 Buckets, Scala, Sql Server, Cassandra, Oozie.
Confidential
Data & Reporting Analyst
Responsibilities:
- Researched and recommended suitable technology stack for Hadoop migration considering current enterprise architecture.
- Responsible for building scalable distributed data solutions using Hadoop.
- Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
- Managed and reviewing Hadoop Log files.
- Used Sqoop to transfer data between relational databases and Hadoop.
- Worked on HDFS to store and access huge datasets within Hadoop.
- Wrote Hive Queries for analyzing data in Hive warehouse using Hive Query Language (HQL).
- Queried both Managed and External tables created by Hive using Impala. Developed a different kind of custom filters and handled pre-defined filters on HBase data using API.
- Handled importing data from different data sources into HDFS using Sqoop and performing transformations using Hive and then loading data into HDFS.
- Good hands on experience with GitHub.
Environment: Hive, Python, HDFS, Tableau, Hbase, MySQL, Impala., AWS, Redshift, Tableau, GitHub.
Confidential
Data Analyst
Responsibilities:
- Extract, manipulate and analyze data and create reports using T-SQL.
- Set up pivot tables in Excel to create multiple reports based on data from a SQL query.
- Involved in requirement gathering, analysis, documentation, follow-ups, reporting and coordination between the business owners and technical team.
- Developed stored procedures and SQL scripts for performing automation.
- Validated the data by using SQL queries extensively.
- Worked on ETL process to clean and load large data extracted from several websites (JSON/ CSV files) to the SQL server.
- Performed Data Profiling, Data pipelining, and Data Mining, validating and analyzing data (Exploratory analysis / Statistical analysis) and generating reports.
- Used Microsoft SSIS and Informatica for extracting, transforming, and loading (ETL) data from spreadsheets, database tables, and other sources.
Environment: T-SQL, MS Excel, MS SQL Server, PowerPoint, Microsoft SSIS, Informatica.
