Senior Big Data Engineer Resume
Weehawken, NJ
SUMMARY
- Overall 8+ Years of strong experience in Software Development Life Cycle (SDLC) including Requirements Analysis, Design Specification and Testing as per Cycle in both Waterfall and Agile methodologies.
- Have very strong inter - personal skills and the ability to work independently and with the group, can learn quickly and easily adaptable to the working environment.
- Experience in Big Data Analytics using HDFS, YARN, MapReduce, Hive, Impala, Pig, Sqoop, HBase, Spark, Spark SQL, Kafka, Spark Streaming, Flume, Oozie, Zookeeper, Hue.
- Experience in installing, configuring and administratingHadoop clusterof major Hadoop distributions.
- Experiencein development, implementation and testing ofBusiness Intelligence and Data Warehousing solutions.
- Proficient in data processing like collecting, aggregating, moving from various sources using Apache Flume and Kafka.
- Develop data set processes for data modelling, and Data mining. Recommend ways to improve data reliability, efficiency and quality.
- Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
- Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
- Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
- Expertise in Amazon Web Services (AWS) Cloud Platform which includes services like EC2, S3, Red shift.
- Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
- Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
- Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing Dimension Tables and Fact tables
- Experience in Data Analysis, Data Profiling, Data Integration, Migration, Data governance and Metadata Management, Master Data Management and Configuration Management.
- Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative filtering, dimensionality reduction.
- Skilled in System Analysis, E-R/Dimensional Data Modeling, Database Design and implementing RDBMS specific features.
- Experience on Migrating SQL database toAzure Data Lake, Azure data lake Analytics,Azure SQL Database, Data BricksandAzure SQL Data warehouseand Controlling and granting database accessandMigrating On premise databases toAzure Data lake storeusing Azure Data factory.
- Experience in developing a data pipeline through Kafka-Spark API.
- Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB using Python.
- Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
- Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.
- Experienced in development and support knowledge on Oracle, SQL, PL/SQL, T-SQL queries.
- Experience in working with Excel Pivot and VBA macros for various business scenarios.
- Experience in data manipulation, data analysis, and data visualization of structured data, semi-structured data, and unstructured data
TECHNICAL SKILLS
Big Data Tools: Kafka, Cassandra, Apache Spark, Spark Streaming, HBase, Impala, HDFS, MapReduce, Hive, Pig, Sqoop, Flume, Oozie, Zookeeper
Languages: PL/SQL, SQL, Scala, Python, PySpark, Java, C, C++, Shell script, Perl script
BI Tools: SSIS, SSRS, SSAS.
Modeling Tools: IBM Infosphere, SQL Power Architect, Oracle Designer, Erwin 9.6/9.5, ER/Studio 9.7, Sybase Power Designer.
Cloud Technologies: AWS and Azure (Azure Data Lake, Azure data lake Analytics,Azure SQL Database, Data BricksandAzure SQL Data warehouse).
Database Tools: Oracle, My SQL, Microsoft SQL Server, Teradata, Mongo DB, Cassandra, HBase
ETL Tools: Pentaho, Informatica Power 9.6, SAP Business Objects XIR3.1/XIR2, Web Intelligence.
Reporting Tools: Business Objects, Crystal Reports.
Tools: & Software TOAD, MS Office, BTEQ, Teradata SQL Assistant.
Operating System: Windows, Dos, Unix, Linux.
Other tools: TOAD, SQL PLUS, SQL LOADER, MS Project, MS Visio and MS Office, have worked on C++, UNIX, PL/SQL etc.
PROFESSIONAL EXPERIENCE
Confidential, Weehawken, NJ
Senior Big Data Engineer
Responsibilities:
- Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
- Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
- Used Airflow to monitor and schedule the work
- Developed various Oracle SQL scripts, PL/SQL packages, procedures, functions, and java code for data
- Worked on a clustered Hadoop for Windows Azure using HDInsight and Hortonworks Data Platform for Windows.
- Setting up Azure infrastructure likestorage accounts, integration runtime, service principalid, app registrations to enablescalable and optimizedutilization of business user analytical requirements in Azure.
- ConfiguredSpark streamingto get ongoing information from theKafkaand store the stream information to HDFS.
- Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
- Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing Developed Apache Spark applications by using spark for data processing from various streaming sources.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Impala, Tealeaf, Pair RDD's, DevOps, Spark YARN
- Implement IOT streaming with Databricks Delta tables and Delta Lake to enable ACID transaction logging
- Exposed transformed datain Azure Spark Databricks platformto parquet formats for efficient data storage
- Creating Data factory pipelines that can bulk copy multiple tables at once from relational database to Azure data lake gen2
- Create custom logging framework for ELT pipeline logging using Append variables in Data factory
- Developed and designeddata integrationandmigration solutionsinAzure.
- Strong Knowledge on architecture and components of Tealeaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka
- UtilizedAnsible playbookfor code pipeline deployment
- Writing PySpark and spark Sql transformation in Azure Databricks to perform complex transformations for business rule implementation
- Extracted and updated the data into HDFS using Sqoop import and export.
- Developed HIVE UDFs to in corporate external business logic into Hive script and Developed join data set scripts using HIVE join operations.
- CreateSelf Servicereportingin Azure Data Lake Store Gen2using an ELT approach.
- Built real time pipeline for streaming data using Eventshub/Microsoft Azure Queue and Spark streaming.
- Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources
Environment: Hadoop, Spark, MapReduce, Kafka, Scala, JAVA, Azure Data Lake Gen2, Azure Data Factory, PySpark, Databricks, Azure DevOps, Agile, Power BI, Python, R, PL/SQL, Oracle 12c, SQL, No SQL, HBase, Scaled Agile team environment
Confidential, Columbus, OH
Big Data Engineer
Responsibilities:
- Performed real time streaming process thru Data lake by using HDP, HDF, NiFi, PySpark
- UsedSparkandSpark-SQLto read the parquet data and create the tables in hive using theScala API.
- Experienced in using thesparkapplication master to monitor thesparkjobs and capture the logs for the spark jobs.
- Got chance working onApache NiFilike executingSpark script, Sqoop scripts throughNiFi, worked on creating scatter and gather pattern inNiFi, ingesting data from Postgres to HDFS, Fetching Hive metadata and storing in HDFS, created a customNiFiprocessor for filtering text from Flow files etc.
- Design and perform data transformation using data mapping and data processing capabilities like Spark SQL and PySpark, Python, Scala
- Develop framework for converting existing PowerCenter mappings and to PySpark (Python and Spark) Jobs.
- Worked inAWSenvironment for development and deployment of custom Hadoop applications.
- Strong experience in working withELASTIC MAPREDUCE(EMR) and setting up environments on AmazonAWSEC2 instances.
- Worked withSparkfor improving performance and optimization of the existing algorithms in Hadoop usingSpark Context,Spark-SQL,Data Frames and Pair RDD's.
- ConfiguredSpark streamingto get ongoing information from theKafkaand store the stream information to HDFS.
- Developed end to end data processing pipelines that begin with receiving data using distributed messaging systems Kafka for persisting data intoCassandra.
- UsingSpark-StreamingAPIs to perform and actions on the fly for building the common learner data model which gets the data from Kafka in near real time and Persists into Cassandra.
- Developed multipleKafkaProducers and Consumers as per the software requirement specifications.
- UsedSpark StreamingAPIs to perform transformations and actions on the fly for building common learner data model which gets the data fromKafkain near real time and persist it toCassandra.
- UsedKafkaandKafka brokers, initiated the spark context and processed live streaming information with RDD and Used Kafka to load data into HDFS and NoSQL databases
- Implemented Data Interface to get information of customers using Rest API and Pre-Process data using MapReduce and store into HDFS (Hortonworks)
- Create Pyspark frame to bring data from DB2 to Amazon S3.
- Performed processing of various data sets ORC; Parquet; Avro; ‘ using PySpark
- Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
- Extract Real time feed usingKafkaandSpark Streamingand convert it to RDD and process data in the form of Data Frame and save the data as Parquet format in HDFS.
- Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
- DevelopedHivequeries for the analysts by loading and transforming large sets of structured, semi structured data using hive.
- Used DataStage as an ETL tool to extract data from sources systems, loaded the data into theORACLEdatabase.
- Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython.
- Experience in building Real-time Data Pipelines withKafkaConnect andSpark Streaming.
- Developed shell scripts to generate thehivecreate statements from the data and load data to the table.
- Involved in writing customMap-Reduceprograms using java API for data processing.
- ImplementedError Handling in Datastageand designed Error jobs to notify user and update log table
Environment: Hadoop, HDP, Hive, Cassandra, Map Reduce, Apache Nifi, Zookeeper, Airflow, Scala, KAFKA, AWS, EC2, S3 Bucket, Redshift, Pyspark, Kubernetes, Oracle 12c, T-SQL, MongoDB, Hbase, Sqoop, Java, Python, Spark, Spark-SQL, Spark-Streaming.
Confidential, Chesapeake, VA
Big Data Engineer
Responsibilities:
- Developed PIG UDFs for manipulating the data according to Business Requirements and also worked on developing custom PIG Loaders.
- Developing ETL pipelines in and out of data warehouse using combination of Python and SQL queries.
- Worked on implementation of a log producer in Scala that watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
- Used Hive to analyze data ingested into HBase by using Hive-HBase integration and compute various metrics for reporting on the dashboard.
- Worked with HIVE data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing and optimizing the HQL queries.
- Developing Spark scripts, UDFS using both Spark DSL and Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.
- Wrote Oozie scripts and setting up workflow using Apache Oozie workflow engine for managing and scheduling Hadoop jobs.
- Developed Spark Streaming programs to process near real time data from Kafka, and process data with both stateless and state full transformations.
- Built and implemented automated procedures to split large files into smaller batches of data to facilitate FTP transfer which reduced 60% of execution time.
- Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
- Experience in report writing using SQL Server Reporting Services (SSRS) and creating various types of reports like drill down, Parameterized, Cascading, Conditional, Table, Matrix, Chart and Sub Reports.
- Used DataStax Spark connector which is used to store the data into Cassandra database or get the data from Cassandra database.
- Transformed the data using AWS Glue dynamic frames with PySpark; cataloged the transformed the data using Crawlers and scheduled the job and crawler using workflow feature
- Developed data pipeline programs with Spark Scala APIs, data aggregations with Hive, and formatting data (JSON) for visualization, and generating.
Environment: Apache Spark, Map Reduce, Apache Pig, Python, Java, SSRS, HBase, AWS, Cassandra, PySpark, Apache Kafka, HIVE, SQOOP, FLUME, Apache Oozie, Zookeeper, ETL, UDF.
Confidential
Data Engineer
Responsibilities:
- Worked on different dataflow and control flow task, for loop container, sequence container, script task, executes SQL task and Package configuration.
- Created new procedures to handle complex logic for business and modified already existing stored procedures, functions, views and tables for new enhancements of the project and to resolve the existing defects.
- Created batch jobs and configuration files to create automated process using SSIS.
- Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
- Creating Hive tables and working on them using Hive QL. Experienced indefining jobflows.
- Importing and exporting data into HDFS from Oracle Database and vice versa using Sqoop.
- Involved in creating Hive tables, loading the data and writing hive queries that will run internally in a map reduce way. Developed a custom File System plugin for Hadoop so it can access files on Data Platform.
- Designed and implemented MapReduce-based large-scale parallel relation-learning system
- Setup and benchmarked Hadoop/HBase clusters for internal use
- Extensive use of Expressions, Variables, Row Count in SSIS packages
- Created SSIS packages to pull data from SQL Server and exported to Excel Spreadsheets and vice versa.
- Deploying and scheduling reports using SSRS to generate daily, weekly, monthly and quarterly reports.
Environment: Hadoop, MapReduce, Pig,MS SQL Server, SQL Server Business Intelligence Development Studio, Hive, Hbase, SSIS, SSRS, Report Builder, Office, Excel, Flat Files, T-SQL.
Confidential
Hadoop Engineer
Responsibilities:
- Installed and configured ApacheHadoopto test the maintenance of log files in Hadoop cluster.
- Installed and configuredHive, Pig, Sqoop, FlumeandOozieon the Hadoop cluster.
- Developed Hive queries to process the data for visualizing.
- InstalledOozie workflowengine to run multiple Hive and Pig Jobs.
- Developed and involved in the industry specificUDF(user defined functions)
- Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
- Used Sqoop to import data into HDFS and Hive from other data systems.
- Continuous monitoring and managing theHadoop clusterthroughCloudera Manager.
- Developed PIG Latin scripts for the analysis of semi structured data.
- Developed JavaMap Reduce programsfor the analysis of sample log file stored in cluster.
- Developed Simple to complex Map/reduce Jobs using HiveandPig
- Setup and benchmarked Hadoop/Hbase clusters for internal use.
- Hands on porting the existing on-premise Hive code migration to GCP (Google Cloud Platform).
- Set up aGCPFirewall rules in order to allow or deny traffic to and from theVM'sinstances based on specified configuration
- Migration of ETL processes from RDBMS to Hive to test the easy data manipulation.
Environment: ApacheHadoop, HDFS, Cloudera Manager, Java, MapReduce, Eclipse, Hive, PIG, Sqoop CentOS, Oozie and SQL.